VUM Host upgrade fails – Another task is already in progress

I came across this issue a few times already, so I figured it would be good to share the findings, especially since it took me down the rabbit hole.

First off, the error “Another task is already in progress” is kind of obvious but yet does not provide enough detail. I ran into this trying to upgrade 4 hosts in a cluster via VUM, no tasks were running at the time of the upgrade start. This was initiated from VMware Cloud Foundation. Thinking the issue was with VCF I tried to do a direct upgrade from VUM, skipping VCF… but received the same error. Also tested with/without vSAN. NFS, etc.

The issue appeared to be related to a race condition where both the VUM remediation AND this “other task” were trying to run at the same time on that host. From the task console we can see that there was another “install” operation running as soon as VUM was trying to remediate. The Initiator give me a hint, and we can see vcHms “cut in line” a did an install operation.

I started looking into the esxupdate log (/var/log/esxupdate.log), and noticed that the host was trying to install a vib. But this vib was already installed, so what gives?

The name of the initiator and one of the vibs gave me a clue (vr2c-firewall.vib). I was running vSphere Replication on that particular cluster, so I started digging in. To validate my suspicion, I shutdown the VR appliance and attempted to upgrade one host. The upgrade worked as expected with no errors, so I was pretty certain something within vSphere Replication was causing this.

First I needed to ssh into the VR appliance or use the console. You won’t be able to ssh into the appliance before enabling ssh. Log in to the console with the root password and run the script to enable ssh.

/usr/bin/enable-sshd.sh

Within the appliance I looked into the config settings for HMS and discovered that there were 2 vibs that were set to auto install at host reconnect. So it appears there is a race condition when VUM and VR and both trying to do an install task at the same time (at reconnect). Makes perfect sense now.

VR_vib_autoinstall

The ESXI update logs indicated that the vr2c-firewall.vib was the one trying to install. After re-checking the vibs (esxcli software vib list) on all the hosts, I did see this vib was already installed but the task from VR kept trying to install at reconnect.

As I workaround, I decided to disable the auto install of this particular vib by running the following command within the /opt/vmware/hms/bin directory and then restart the hms service:

./hms-configtool -cmd reconfig -property hms-auto-install-vr2c-vib=false

service hms restart

hms-configtool

This workaround worked as a charm and I was able to upgrade the rest of the cluster using VUM. I did not find an official KB about this, and this is by no means an official workaround/fix.

Disclaimer: If you plan to implement this fix, be aware that this is not an official VMware blog, and changes to products may or may not cause issues and/or affect support.

Devices unavailable for vSAN

I’ve been experiencing this scenario quite a bit lately, so I figured I’ll write something about it, also helps me refresh my memory.

Lately I’ve been helping out with a lot of VMware Cloud Foundation Proof of Concepts (VCF POCs)… That’s a mouthful!

Upon inspection of the environment I am supposed to work on, I have found hosts that were once part of a vSAN cluster but were not properly cleaned up prior to ESXi rebuild.

The vSAN clean up process involves the following steps:

  1. Set host in Maintenance Mode
  2. Delete Disk Group(s)
    • This removes vSAN partitions as well
  3. Remove host from vSAN cluster
  4. Clean up network
  5. Remove from vCenter

As previously mentioned, I usually get called in after the vCenter is gone and hosts have been re-imaged. Now what?!?!

The steps to manually clean the hosts involve the following tasks:

Delete Disk Groups via CLI

Since we do not have access to vCenter in this case, we can delete the Disk Group via CLI using esxcli commands.

You can remove the disks by specifying the ssd cache device to remove that device and all backing capacity devices, or you can do it one device at a time by using the device’s uuid.

To remove the devices one a time you can use vdq -q command to list the devices and then use esxcli vsan storage remove -u <uuid>. I prefer the other method since it is a lot faster…

  • List devices by using vdq -i

  • Run esxcli vsan storage remove -s <cache device>
  • In this case we will run this command for naa.55cd2e404b66fcbb and naa.55cd2e404b4393b9

 

Now What?!?!

Now that we have deleted the disk groups, we need to make sure that the host doesn’t think it is still part of a vSAN cluster.

Remove Host from vSAN Cluster via CLI

  • Check vSAN cluster membership
    • esxcli vsan cluster get
  • If it belongs to a cluster, remove it
    • esxcli vsan cluster leave

Clean up Network configuration

Since the hosts were re-imaged, this should be all set to default, but in case this was not cleaned up, or was manually re-created, you can clean it up by resetting the configuration.

The following command resets the root, password, overrides all the network configuration changes, and reboots the host. Don’t do this unless you are planning to reset the host to default… You’ve been warned.

/bin/firmwareConfig.py –reset

 

At this point, you can use the cleaned host to either join a new vSAN cluster, commission a host in VCF SDDC Manager, or use it as part of the VCF bring up process for the Management Domain.

 

@GreatWhiteTec

Lab ESXi err_cert_revoked in Chrome

I recently deployed a new lab and encountered an error from Chrome – err_cert_revoked. Usually I click through the Chrome warnings and accept moving forward in “unsafe” mode.

However, there was no option to continue. The error supplied indicated “You cannon visit <yoursite.com” right now because this certificate has been revoked…”

Since this is an internal lab, I don’t worry much about external certs and what not, i just needed to get in my lab to do some work…

Workaround

Rename Hosts to correct name

First I found out all my hosts were named “localhost.mylab.com”, so naturally the first step was to fix the host names. Easy. Go to DCUI and change the host names for each host.

 

Backup certificate and Generate a new certificates

Once I changed all the names, I made a backup of the original certs, just in case by running the following commands under /etc/vmware/ssl 

mv rui.crt backup.rui.crt

mv rui.key backup.rui.key

Then generated new certificates by running /sbin/generate-certificates

Rebooted my hosts

 

Download new certificate

For this step, I opened the esxi UI in FireFox and when I got the error, I had the option to download the certificate and keychain. I clicked on PEM (cert) to download the cert.

 

Trust Certificate

Once I downloaded the cert I opened it on my Mac with Keychain Access. I trusted the certificate by double clicking on the cert and under Trust> changed from Use System Defaults to Always Trust under “When using this certificate” drop-down.

 

THIS IS A LAB ENVIRONMENT (internal). DO NOT TRUST sites you are not familiar with. 

 

Unable to power on VM “vmwarePhoton64Guest’ is not supported”

I was in the process of deploying vSAN Performance Monitor  and came across an error.

The error pops up as “No host is compatible with the virtual machine”

It further describes the error as “vmwarePhoton64Guest’ is not supported”

 

The FIX:

Right Click on the vSAN Performance Monitor VM

Select Compatibility> Upgrade VM Compatibility

 

 

 

Select yes for the upgrade. Warning: This step is NOT reversible.

 

 

Finally, select the version you want the VM to be compatible with depending on the version you are currently running.

 

More info about VM Compatibility Settings available here

What’s new on vSAN Encryption 6.7 U1?

I’ve written a few blog posts in the past about vSAN Data at Rest Encryption (D@RE). These posts explain how encryption works, and how the keys are handed over to vSphere. Go here for more info.

For vSAN D@RE to work properly, ESXi hosts need to be able to reach the KMS cluster during reboot operations. Yes, hopefully you have a cluster for redundancy, but a single KMS server will still work. This is necessary in order for ESXi hosts within the vSAN cluster to be able to obtain both the Host Encryption Key (let’s call this HEK), and the Key Encryption Key (KEK).

Wait!!! Why do we have to go to KMS again if we already received the keys?!?!

See, The Host Encryption Key, and the Key Encryption Key live in a non persistent state in memory, in the key cache. When a vSAN node (ESXi server) is rebooted, these key go away (poof…gone). So, when vSAN encryption is enabled, and the hosts are rebooted, it needs to go out to the KMS and get those keys. So you may want to make sure that your hosts can talk to KMS, and that KMS has your keys before you consider rebooting your hosts. Oh yeah, it goes without saying that the KMS should NOT be in the vSAN cluster, and you can see why.

Once the HEK is obtained, the host reaches a crypto-safe mode, which allows the host to obtain a good operational state, and continue with the boot process, at which point it asks for the KEK from KMS. If the host is not able to obtain such keys from the KMS cluster, the host will continue to boot; however, the disks will not be mounted as the host was not in crypto-safe mode, and it was not able to obtain the KEK from KMS resulting in failure to unwrap the Data Encryption Key (DEK).

In a scenario where hosts are being updated/upgraded via VUM, in most occasions the hosts will do a rolling reboot as part of the VUM process. With vSAN versions 6.7 and prior, rolling reboots of hosts via VUM were allowed, irrelevant of the state of the connection with KMS, and the availability of keys. As already described, these keys are necessary in order to properly mount the drives on each host during a reboot.

In vSAN 6.7 Update 1, VMware has added guard rails to prevent disks of multiple hosts from unmounting due to lack of connectivity with KMS, or accidental key deletion. During an upgrade operation, VUM will place a host in Enhanced Maintenance Mode (EMM), perform updates, reboot, and exit EMM. If after a reboot, the host is not able to reach crypto-safe mode, the host will not exit EMM – stalling the VUM progress. In this case, the host’s drives are not mounted due to it not being able to reach the crypto-safe mode, if we allow the upgrade to continue, all other hosts will upgrade, but all the drives within the vSAN datastore will be unmounted.

This new guard rail, helps prevent losing all vSAN storage due to connectivity issues, or accidental changes with KMS, and key availability. This feature also highlights the benefits of having a HCI solution embedded in the kernel, the ease of orchestration with other vSphere components, and features makes vSAN even more appealing.