I’ve written a few blog posts in the past about vSAN Data at Rest Encryption (D@RE). These posts explain how encryption works, and how the keys are handed over to vSphere. Go here for more info.
For vSAN D@RE to work properly, ESXi hosts need to be able to reach the KMS cluster during reboot operations. Yes, hopefully you have a cluster for redundancy, but a single KMS server will still work. This is necessary in order for ESXi hosts within the vSAN cluster to be able to obtain both the Host Encryption Key (let’s call this HEK), and the Key Encryption Key (KEK).
Wait!!! Why do we have to go to KMS again if we already received the keys?!?!
See, The Host Encryption Key, and the Key Encryption Key live in a non persistent state in memory, in the key cache. When a vSAN node (ESXi server) is rebooted, these key go away (poof…gone). So, when vSAN encryption is enabled, and the hosts are rebooted, it needs to go out to the KMS and get those keys. So you may want to make sure that your hosts can talk to KMS, and that KMS has your keys before you consider rebooting your hosts. Oh yeah, it goes without saying that the KMS should NOT be in the vSAN cluster, and you can see why.
Once the HEK is obtained, the host reaches a crypto-safe mode, which allows the host to obtain a good operational state, and continue with the boot process, at which point it asks for the KEK from KMS. If the host is not able to obtain such keys from the KMS cluster, the host will continue to boot; however, the disks will not be mounted as the host was not in crypto-safe mode, and it was not able to obtain the KEK from KMS resulting in failure to unwrap the Data Encryption Key (DEK).
In a scenario where hosts are being updated/upgraded via VUM, in most occasions the hosts will do a rolling reboot as part of the VUM process. With vSAN versions 6.7 and prior, rolling reboots of hosts via VUM were allowed, irrelevant of the state of the connection with KMS, and the availability of keys. As already described, these keys are necessary in order to properly mount the drives on each host during a reboot.
In vSAN 6.7 Update 1, VMware has added guard rails to prevent disks of multiple hosts from unmounting due to lack of connectivity with KMS, or accidental key deletion. During an upgrade operation, VUM will place a host in Enhanced Maintenance Mode (EMM), perform updates, reboot, and exit EMM. If after a reboot, the host is not able to reach crypto-safe mode, the host will not exit EMM – stalling the VUM progress. In this case, the host’s drives are not mounted due to it not being able to reach the crypto-safe mode, if we allow the upgrade to continue, all other hosts will upgrade, but all the drives within the vSAN datastore will be unmounted.
This new guard rail, helps prevent losing all vSAN storage due to connectivity issues, or accidental changes with KMS, and key availability. This feature also highlights the benefits of having a HCI solution embedded in the kernel, the ease of orchestration with other vSphere components, and features makes vSAN even more appealing.