Troubleshooting Upgrade Failures

Collect Support Bundles

You can collect support bundles on registered cluster and fabric nodes and download the bundles to your machine or upload them to a file server.

If you choose to download the bundles to your machine, you get a single archive file consisting of a manifest file and support bundles for each node. If you choose to upload the bundles to a file server, the manifest file and the individual bundles are uploaded to the file server separately.

Procedure

From your browser, log in as a local admin user to an NSX Manager at https://nsx-manager-ip-address/login.jsp?local=true.
Select System > Support Bundle
Select the target nodes.
- The available types of nodes are Management Nodes, Edges, Hosts, and Public Cloud Gateways.
(Optional) Specify log age in days to exclude logs that are older than the specified number of days.
(Optional) Toggle the switch that indicates whether to include or exclude core files and audit logs.
(Optional) Select the check box to upload the bundles to a remote file server.
Click Start Bundle Collection to start collecting support bundles.
- Depending on how many log files exist, each node might take several minutes.
Monitor the status of the collection process.
- The status tab shows the progress of collecting support bundles.
Click Download to download the bundle if the option to send the bundle to a file remote server was not set.
- The bundle collection may fail for a manager node if there is not enough disk space. If you encounter an error, check whether older support bundles are present on the failed node. Log in to the NSX Manager UI of the failed manager node using its IP address and initiate the bundle collection from that node. When prompted by the NSX Manager, either download the older bundle or delete it.

Upgrade Fails Due to a Timeout

An event during the upgrade process fails and the message from the Upgrade Coordinator indicates a timeout error.

Problem

During the upgrade process, the following events might fail because they do not complete within a specific time. The Upgrade Coordinator reports a timeout error for the event and the upgrade fails.

Event	Timeout Value
Putting a host into maintenance mode	4 hours
Waiting for a host to reboot	32 minutes
Waiting for the NSX service to be running on a host	13 minutes

Solution

For the maintenance mode issue, log in to VMware vCenter and verify the status of tasks related to the host. Resolve any problems.
For the host reboot issue, check the host to see why it failed to reboot.
For the NSX service issue, log in to the NSX Manager UI, select System > Appliances and see if the host has an installation error. If so, you can resolve it from the NSX Manager UI. If the error cannot be resolved, you can refer to the upgrade logs to determine the cause of the failure.

Upgrade Fails Due to Insufficient Space in Bootbank on ESXi Host

NSX upgrade might fail if there is insufficient space in the bootbank or in the alt-bootbank on an ESXi host.

Problem

Unused VIBs on the ESXi host might be relatively large in size and therefore use up significant disk space. The unused VIBs can result in insufficient space in the bootbank or in the alt-bootbank during upgrade.

Solution

Uninstall the VIBs that are no longer required and free up additional disk space.

For more information on locating and deleting VIBs, see the VMware knowledge base article at https://kb.vmware.com/s/article/74864

Unable to Upgrade Host Placed in NSX Maintenance Mode

Host unit fails during the upgrade process and the upgrade coordinator places this host in NSX maintenance mode. Unable to upgrade host placed in NSX maintenance mode on restarting upgrade.

Problem

Hosts that fail during upgrade are placed in NSX maintenance mode.

Solution

Manually troubleshoot and fix the problem on the host.
From the NSX Manager UI, select System > Fabric > Hosts.
Locate the host that you fixed and select it.
- The status of the host is maintenance mode.
Evacute any VMs present on the host and restart the host.
Select Actions > Exit Maintenance Mode.

Failure to Upload the Upgrade Bundle

The upgrade bundle fails to upload because of insufficient disk space.

Solution

In the NSX Manager CLI, delete the unused files located at /image/vmware/nsx/file-store/* and /image/core/*.
- Note: Ensure that you do not delete the /image/upgrade-coordinator-tomcat folder or other folders located at /image.
From your browser, log in as a local admin user to an NSX Manager at https://nsx-manager-ip-address/login.jsp?local=true.
Select System > Support Bundle and delete any unused support bundles.
Reupload the upgrade bundle and continue with the upgrade process.

Backup and Restore During Upgrade

The Management Plane stops responding during the upgrade process and you need to restore a backup that was taken while the upgrade was in progress.

Problem

The Upgrade Coordinator has been upgraded and the Management Plane stops responding. You have a backup that was created while the upgrade was in progress.

Solution A

Deploy your Management Plane node with the same IP address that the backup was created from.
Upload the upgrade bundle that you used at the beginning of the upgrade process.
Upgrade the Upgrade Coordinator.
Restore the backup taking during the upgrade process.
Upload a new upgrade bundle if necessary.
Continue with the upgrade process.

Solution B

If you have upgraded from NSX 3.2 or later versions, you can also restore the system from the local backup taken by the upgrade process just before upgrading the first NSX Manager. The local backup is available at /image/backup/<unified_app_version>/cluster-node-backups on all manager nodes. If you want to use the local backup for restore, copy the backup file from NSX Manager to an SFTP location and then perform the following steps to restore the system.

Log in to NSX Manager as a root user.
Run the following command to copy the backup file to an SFTP server.
scp -rp /image/backup/<unified_app_version>/* user@<SFTP server IP address>:/<backup_path>
Run the following command to view the generated passphrase.
cat//image/.backup_keystore/.keyfile
Select the passphrase to copy and save it at a secure location.

While copying the file, you must maintain the same directory structure on the SFTP location as on NSX Manager.

Loss of Controller Connectivity after Host Upgrade

Controller connectivity is lost after you upgrade your hosts.

Problem

After upgrading your host, when running post checks, your Node Status shows loss of connectivity to the controller.

Solution

Open an SSH session to the ESXi host experiencing the issue and confirm that none of the three NSX controllers are in a connected state. Run the nsxcli -c get controllers command.
- In a working configuration, two controllers display the not used status and one controller has the connected status. If the NSX Controller shows connected, refresh the UI and confirm that the status is green. If the controller shows not connected, continue to the next step.
Open an SSH session to one of the NSX Manager nodes as admin and run the get certificate api thumbprint command.
- The command output is a string of alphanumeric numbers that is unique to this NSX Manager.
On the ESXi host, push the host certificate to the Management Plane:
- ESXi1> nsxcli -c push host-certificate <NSX Manager IP or FQDN> username admin thumbprint <thumbprint obtained in step #1>
- When prompted, enter the admin user password for the NSX Manager. See the NSX Command-Line Interface Reference for more information.
Confirm the controller status is connected.
- ESXi1> nsxcli -c get controllers
- Confirm the controller connection state is green on the UI for this Transport Node.
- If this issue continues, restart the following NSX services on the ESXi host:
  - ESXi1> /etc/init.d/nsx-opsagent restart
  - ESXi1> /etc/init.d/nsx-proxy restart

In-place Upgrade Fails

If an in-place upgrade fails for an ESXi 7.0 host, except when you see a PSOD, vMotion the VMs out of the host and then reboot the host.

Solution

Log in to VMware vCenter and place the host in maintenance mode.
For an ESXi 7.0 host, use the following command to clear the upgrade status flag on the host:
nsxcli -c set host-switch upgrade-status false
vMotion the VM's out of the host.
Reboot the host and resume the upgrade process.

NSX Manager User Interface is Inaccessible During Upgrade

When upgrading from NSX 3.1.x or 3.2, the NSX Manager User Interface may be inaccessible during the Management Plane upgrade.

Problem

The NSX upgrade has been running longer than expected and the NSX Manager user interface is not accessible.

Cause

When upgrading from NSX 3.1.x or 3.2, the NSX Manager user interface is inaccessible during the Management Plane upgrade.

Solution

The inaccessability of the NSX Manager user interface does not necessarily indicate an upgrade failure. To verify the upgrade status, run the following command from the NSX Manager CLI:

get upgrade progress-status

If you see an upgrade failure, follow the troubleshooting steps that are displayed in the command output.

Troubleshooting Upgrade Failures

Collect Support Bundles

Procedure

Upgrade Fails Due to a Timeout

Problem

Solution

Upgrade Fails Due to Insufficient Space in Bootbank on ESXi Host

Problem

Solution

Unable to Upgrade Host Placed in NSX Maintenance Mode

Problem

Solution

Failure to Upload the Upgrade Bundle

Solution

Backup and Restore During Upgrade

Problem

Solution A

Solution B

Loss of Controller Connectivity after Host Upgrade

Problem

Solution

In-place Upgrade Fails

Solution

NSX Manager User Interface is Inaccessible During Upgrade

Problem

Cause

Solution

Filter Tags