vSAN 6.7 Upgrade Considerations

On April 17, 2018, VMware released vSphere 6.7. This includes vCenter, ESXi, and of course vSAN. A lot of people are looking to upgrade in order to take advantage of the new features, and enhancements; primarily the H5 client… good bye flash client! Links to more info on what’s new for vSphere and vSAN.

From an HMTL 5 client perspective, the feature parity is about 95%. For vSAN alone, it is about 99%, as we are only missing Configuration Assist, which is still available via the flash client.

I see a lot of people getting confused, and still believe that vSAN is a separate upgrade, just like traditional storage. Fortunately, vSAN is in the kernel, so once you upgrade ESXi you have also upgraded vSAN. Boom!!! Even though the version numbers may not be exact between ESXi and vSAN, they still go hand-in-hand. With that, it is important to understand the steps necessary for a vSAN upgrade.

Based on the nature of vSAN, we need to follow the vSphere upgrade path. This includes checking the VMware Product Interoperability Matrices, not only with your current versions against the versions you are going to upgrade to, but also all the other VMware products such as NSX, SRM, vROps, etc.

 

Upgrade Process Overview

From an upgrade process perspective, you have options. You can migrate your Windows vCenter to the vCenter Appliance (recommended). If you already have the vCenter Appliance, you can either do an in-place upgrade, or create a new vCenter if you want to migrate your hosts over to a fresh new vCenter install. Here is more info on vSphere upgrades.

  1. Upgrade PSC
  2. Upgrade, Migrate, or deploy new vCenter
    1. Depends on current version
  3. Upgrade ESXi
    1. This will also upgrade vSAN (easy, right?)
  4. Upgrade vSAN on-disk Format (ODF)
  5. Upgrade VMware tools
  6. Upgrade vDS

 

As previously discussed, you will need to check the Product Interoperability Matrix to make sure all your products can run on vSphere 6.7. Don’t shoot from the hip, and start upgrading before you have done proper research.

 

I mentioned the choice of migrating hosts to a new vCenter. This is something I do quite often in my labs, and it is a simple process.

Migration Process Overview

  1. Export vDS configuration (including port groups)
  2. Copy licenses from old vCenter
  3. Configure new vCenter
  4. Create Datacenter in vCenter
  5. Create a Cluster and enable vSAN on it
    1. If you currently have other services enabled, they will have to be enabled on the new vCenter as well prior to moving the hosts.
  6. Add licenses
    1. Assign license to vCenter
    2. Assign vSAN license to cluster asset
  7. Create vDS on the new vCenter
  8. Import configuration and port groups to new vCenter
  9. On the new vCenter, add hosts
    1. No need to disconnect hosts on the old vCenter, they will disconnect after connecting to the new vCenter.
    2. Confirm ESXi license or assign a new one.
  10. Connect the hosts to the vDS (imported)
    1. Make sure you go through and verify assignment of uplinks, failover order, vmkernel ports, etc.
  11. Lastly, you will need to tell vSphere that this new vCenter is now authoritative
    1. You will get an alert about this

 

vSAN Perfomance
HTML 5 Client

Considerations when Enabling vSAN Encryption

In previous posts, I talked about vSAN Encryption architecture, and how to enable such feature. However, there are a couple of considerations aside from the requirements that should be taken into account prior to enabling vSAN Encryption.

BIOS Settings:

With most deployments, whether it is vSphere, or vSAN; I’ve noticed that BIOS settings are often overlook, even though they can help increase performance with a simple change. One of those settings is AES-NI. AES-NI was proposed by Intel some time back, and it is essentially a set of [new] instructions (NI), for the Advanced Encryption Standard (AES); hence the acronym AES-NI. What AES-NI does, is provide hardware acceleration to applications using AES for encryption, and decryption.

Most modern CPUs (Intel & AMD), support AES-NI, and some BIOS configurations from certain hardware vendors already have AES-NI enabled by default. When considering vSAN Encryption, it is imperative to make sure that AES-NI has been enabled in the BIOS, in order to take advantage of such offloading of instructions to the CPU as well as strengthening, and accelerating the execution of AES applications.

Failure to enable AES-NI while Encryption is enabled, may result in a dramatic cpu utilization increase. In recent versions of vSAN, the Health Check UI detects, and alerts when AES-NI has not been enabled. If the BIOS does not have the option to enable AES-NI, it is most likely that the feature is always enabled.

Note: This also applies to VM encryption.

 

Available Space

The other consideration is available space. My previous posts talk about data migration occurring if vSAN Encryption was enabled after data has been moved into the vSAN Datastore, due to the disk format task necessary. Although vSAN Encryption does not incur a space overhead for its operation, it is important to keep in mind that there needs to be enough available space to be able to evacuate an entire disk group during the configuration process.

 

Considerations for using LACP with vSAN

I am a firm believer on spending a good amount of time during the design phase of any deployment. Planning ahead, and knowing your limitations will make your life easier, maybe just a little bit, but every bit helps.

If you are planning on using LACP for link aggregation on vSAN, I strongly advise you to get familiar with your options, and check the Network Design guide at storagehub.vmware.com . In the Network Design Guide here you will learn about NIC teaming options, and LACP requirements such as LAG, vDS, as well as the PROs and CONs (below).

Pros and Cons of dynamic Link Aggregation/LACP (from Storagehub)

Pros

  • Improves performance and bandwidth: One vSAN node or VMkernel port can communicate with many other vSAN nodes using many different load balancing options
  • Network adapter redundancy: If a NIC fails and the link-state goes down, the remaining NICs in the team continue to pass traffic.
  • Rebalancing of traffic after failures is fast and automatic

Cons

  • Physical switch configuration: Less flexible and requires that physical switch ports be configured in a port-channel configuration.
  • Complex: Introducing full physical redundancy configuration gets very complex when multiple switches are used. Implementations can become quite vendor specific.

 

In addition to this, you should also consider vSphere limitations while using LACP. Some of the limitations include:

  • LACP settings not available for host profiles
  • No port mirroring
  • Does not work with ESXi dump collector.

For a complete list of limitations, visit VMware Docs here

Be cognizant of some “gotchas” as well. For example, restarting management agents on ESXi with vSAN/LACP using services.sh script, may cause some issues. Instead use”/etc/init.d/<module> restart” command to restart individual instances.  In essence if you use “services.sh restart” script to restart services, it will also restart the lacp daemon (/etc/init.d/lacp). See KB1003490

 

Like with any other deployment, you should consider your PROs, CONs, future plans, and environment dependencies among others.

Remember the 7Ps – Proper Prior Planning Prevents Pitiful Poor Performance

 

Get Inbox drivers for Storage Controllers with vSAN

If you are familiar with vSAN, you already know that the VCG is a “must follow” guide to success. In certain instances, deployments may use ESXi OEM/custom images based on internal policies or even personal preference. However, these type of images contain vendor specific tools, drivers, etc. One of the results from using such images is the use of async drivers for storage controllers rather than inbox drivers. For sake of demonstration, we can focus on the PERC H730.

While checking config assist, you can see a warning stating that the recommended driver per the VCG is not currently installed. In the picture below you can see that we have uncertified drivers, and we need to roll the drivers back to the correct version.

Links to download such drivers are typically found within VCG or the ESXi “drivers” tab from the downloads page. In some occasions, this link may not be present for the version that you are running. So how do I get the correct drivers? You certainly don’t want to be running drivers that have not been certified for vSAN, do you?!?! Of course NOT.

You can get such drivers from the ESXi offline bundle of the version you currently have. For example, let’s say you are running ESXi 6.5 U1. You will need to go to https://my.vmware.com, log in using your credentials, go to downloads, and select ESXi 6.5 U1 as the product. Download the Offline Bundle (zip). Once completed, unzip the file and navigate the vib folder until you find the correct driver. In this case we are looking for lsi_mr3 version 6.910.18.00-1vmw.650.0.0.4564106. You can then take that vib and use your preferred vib install method to update your drivers.

 

Replacing vCenter with vSAN Encryption Enabled

In my previous post, I talked about vSAN Encryption configuration, and key re-generation among other topics. On that post you can see that there is a trust relationship amongst the vCenter and KMS server/cluster. But what happens if my vCenter dies, gets corrupted, or I simply want to build a new vCenter and migrate my vSAN nodes to it with Encryption enabled???

One day, the lab that hosts my vCenter had a power issue and the VCSA appliance became corrupted. I was able to fix the issue, but then discovered that SSO was not working. I figured it was faster to deploy a new VCSA appliance rather than troubleshooting (yes, I’m impatient). I deleted the old vCenter and proceeded to deploy a new VCSA.

As I was adding the host to the new vCenter, I remembered that vSAN encryption was enabled. Now what? Sure enough after all the hosts were moved, the drives from within the Disk Groups were showing unmounted. I went ahead and created a new relationship with the same KMS cluster, but the issue persisted.

If you run the command “vdq -q” from one of the host, you will see that your drives are not mounted and are ineligible for use by vSAN. In the UI you will see that your disks are encrypted and locked because the encryption key is not available.

The FIX:

In order to recover from this and similar scenarios, it is necessary to create a new cluster with the same exact configuration as before. Although I did establish a relationship with the same KMS cluster, I missed a very important step, the KMS cluster ID.

It is imperative that the same KMS cluster ID remains in order for the recovery feature to work. Let’s think about the behavior. Although the old vCenter is gone, the hosts still have the information and keys from the KMS cluster, if we connect to the same KMS cluster with the same cluster ID, the hosts will be able to retrieve the key (assuming the key still exists and was not deleted). The KMS credentials will be re-applied to all hosts so that hosts can connect to KMS to get the keys.

Remember that the old vCenter was removed, so I couldn’t retrieve the KMS cluster ID from the vCenter KMS config screen, and this environment was undocumented since it is one of my labs (it is now documented). Now what?

Glad you asked. Let’s take a look at the encryption operation.

In this diagram we can see how the keys are distributed to vCenter, hosts, etc. The KMS server settings are passed to hosts from vCenter by the KEK_id.

In order to obtain the kmip cluster ID, we need to look for it under the esx.conf file for the hosts.  You can use cat, vi, or grep (easier) to look at the conf file. You want to look for kmipClusterId, name(alias), etc. Make sure the KMS cluster on the new vCenter configured exactly as it was before.

cat /etc/vmware/esx.conf 

or something easier…

grep “/vsan/kmipServer/” /etc/vmware/esx.conf

After the KMS cluster has been added to new vCenter as it was configured in the old vCenter, there is no need for reboots. During reconfiguration the new credentials will be sent to all hosts and such hosts should reload keys for all disks in a few minutes.