Considerations for using LACP with vSAN

I am a firm believer on spending a good amount of time during the design phase of any deployment. Planning ahead, and knowing your limitations will make your life easier, maybe just a little bit, but every bit helps.

If you are planning on using LACP for link aggregation on vSAN, I strongly advise you to get familiar with your options, and check the Network Design guide at storagehub.vmware.com . In the Network Design Guide here you will learn about NIC teaming options, and LACP requirements such as LAG, vDS, as well as the PROs and CONs (below).

Pros and Cons of dynamic Link Aggregation/LACP (from Storagehub)

Pros

  • Improves performance and bandwidth: One vSAN node or VMkernel port can communicate with many other vSAN nodes using many different load balancing options
  • Network adapter redundancy: If a NIC fails and the link-state goes down, the remaining NICs in the team continue to pass traffic.
  • Rebalancing of traffic after failures is fast and automatic

Cons

  • Physical switch configuration: Less flexible and requires that physical switch ports be configured in a port-channel configuration.
  • Complex: Introducing full physical redundancy configuration gets very complex when multiple switches are used. Implementations can become quite vendor specific.

 

In addition to this, you should also consider vSphere limitations while using LACP. Some of the limitations include:

  • LACP settings not available for host profiles
  • No port mirroring
  • Does not work with ESXi dump collector.

For a complete list of limitations, visit VMware Docs here

Be cognizant of some “gotchas” as well. For example, restarting management agents on ESXi with vSAN/LACP using services.sh script, may cause some issues. Instead use”/etc/init.d/<module> restart” command to restart individual instances.  In essence if you use “services.sh restart” script to restart services, it will also restart the lacp daemon (/etc/init.d/lacp). See KB1003490

 

Like with any other deployment, you should consider your PROs, CONs, future plans, and environment dependencies among others.

Remember the 7Ps – Proper Prior Planning Prevents Pitiful Poor Performance

 

Get Inbox drivers for Storage Controllers with vSAN

If you are familiar with vSAN, you already know that the VCG is a “must follow” guide to success. In certain instances, deployments may use ESXi OEM/custom images based on internal policies or even personal preference. However, these type of images contain vendor specific tools, drivers, etc. One of the results from using such images is the use of async drivers for storage controllers rather than inbox drivers. For sake of demonstration, we can focus on the PERC H730.

While checking config assist, you can see a warning stating that the recommended driver per the VCG is not currently installed. In the picture below you can see that we have uncertified drivers, and we need to roll the drivers back to the correct version.

Links to download such drivers are typically found within VCG or the ESXi “drivers” tab from the downloads page. In some occasions, this link may not be present for the version that you are running. So how do I get the correct drivers? You certainly don’t want to be running drivers that have not been certified for vSAN, do you?!?! Of course NOT.

You can get such drivers from the ESXi offline bundle of the version you currently have. For example, let’s say you are running ESXi 6.5 U1. You will need to go to https://my.vmware.com, log in using your credentials, go to downloads, and select ESXi 6.5 U1 as the product. Download the Offline Bundle (zip). Once completed, unzip the file and navigate the vib folder until you find the correct driver. In this case we are looking for lsi_mr3 version 6.910.18.00-1vmw.650.0.0.4564106. You can then take that vib and use your preferred vib install method to update your drivers.

 

Replacing vCenter with vSAN Encryption Enabled

In my previous post, I talked about vSAN Encryption configuration, and key re-generation among other topics. On that post you can see that there is a trust relationship amongst the vCenter and KMS server/cluster. But what happens if my vCenter dies, gets corrupted, or I simply want to build a new vCenter and migrate my vSAN nodes to it with Encryption enabled???

One day, the lab that hosts my vCenter had a power issue and the VCSA appliance became corrupted. I was able to fix the issue, but then discovered that SSO was not working. I figured it was faster to deploy a new VCSA appliance rather than troubleshooting (yes, I’m impatient). I deleted the old vCenter and proceeded to deploy a new VCSA.

As I was adding the host to the new vCenter, I remembered that vSAN encryption was enabled. Now what? Sure enough after all the hosts were moved, the drives from within the Disk Groups were showing unmounted. I went ahead and created a new relationship with the same KMS cluster, but the issue persisted.

If you run the command “vdq -q” from one of the host, you will see that your drives are not mounted and are ineligible for use by vSAN. In the UI you will see that your disks are encrypted and locked because the encryption key is not available.

The FIX:

In order to recover from this and similar scenarios, it is necessary to create a new cluster with the same exact configuration as before. Although I did establish a relationship with the same KMS cluster, I missed a very important step, the KMS cluster ID.

It is imperative that the same KMS cluster ID remains in order for the recovery feature to work. Let’s think about the behavior. Although the old vCenter is gone, the hosts still have the information and keys from the KMS cluster, if we connect to the same KMS cluster with the same cluster ID, the hosts will be able to retrieve the key (assuming the key still exists and was not deleted). The KMS credentials will be re-applied to all hosts so that hosts can connect to KMS to get the keys.

Remember that the old vCenter was removed, so I couldn’t retrieve the KMS cluster ID from the vCenter KMS config screen, and this environment was undocumented since it is one of my labs (it is now documented). Now what?

Glad you asked. Let’s take a look at the encryption operation.

In this diagram we can see how the keys are distributed to vCenter, hosts, etc. The KMS server settings are passed to hosts from vCenter by the KEK_id.

In order to obtain the kmip cluster ID, we need to look for it under the esx.conf file for the hosts.  You can use cat, vi, or grep (easier) to look at the conf file. You want to look for kmipClusterId, name(alias), etc. Make sure the KMS cluster on the new vCenter configured exactly as it was before.

cat /etc/vmware/esx.conf 

or something easier…

grep “/vsan/kmipServer/” /etc/vmware/esx.conf

After the KMS cluster has been added to new vCenter as it was configured in the old vCenter, there is no need for reboots. During reconfiguration the new credentials will be sent to all hosts and such hosts should reload keys for all disks in a few minutes.

 

vSAN Stats Object Out of Date

Several people asked me this question several times, so I figured I’ld write a quick post about it.

When the default vSAN policy was being changed, people started noticing that the Stats Objects (Health) will show as “Out of Date”, even though the policy was applied at the end of the wizard.

A few things to keep in mind:

  • The Stats Object is exactly that, an Object, just like a VM home folder, or VMDK.
    • That object is associated with a Policy, usually the default vSAN policy
  • If you change a policy, you can apply this immediately through the wizard
    •  However, this applies the policy to the VMs (objects)
    • Stats Object is not part of any VM
  • If you change the policy that the Stats Object is using or sharing with VMs, then you will need to manually re-apply that policy to the Stats Object.

Scenario

  1. Policy change (Default in this case)
  2. Reapply Policy to VMs now
  3. Stats Object show “Out of Date”
  4. Edit the Storage Policy under Health and Performance and click OK
  5. This will bring the Object back to compliance

pol_apply_now

 

 

 

 

 

out-of-date

stats-compliant

 

 

 

 

Quick Video about it

vSAN VCG Checks

One of the most important aspects of any storage solutions, involves utilizing hardware to its advantage. Many storage vendors have taken advantage of faster drives and other technologies to create fast storage solutions, and vSAN is no different. We will discuss why it is so important for vSAN to have compatible/supported hardware and how to check this.

One of the main requirements for VMware’s HCI solution is for hardware to be on its Hardware Compatibility List (HCL), also known as VMware Compatibility Guide (VCG). This compatibility guide will allow you to check existing hardware and/or hardware that you plan to purchase for vSAN. You can also check vSAN ready nodes against this guide.

Before you deploy vSAN, all hardware must have passed the compatibility test. This is to ensure that the best performance will be achieved, as well as reducing possible issues due to hardware. Hardware compatibility with vSAN includes but not limited to hard drives (MD), flash devices, storage controllers, etc. It is not only necessary for the hardware to be on the compatibility list, but also have the appropriate firmware and driver versions for the specific version of ESXi.

How to check hardware against VCG

You can check hardware, firmware and driver version by going to VMware’s VCG website here

You can also check compatibility of vSAN ready nodes at this site.

Once vSAN has been deployed, vSAN will check your hardware compatibility against the downloaded VCG version. You can also update the local VCG version from the Web UI. To make sure the HCL DB is up to date on your cluster go to Cluster>Manage>Settings>Health and Performance from the Web UI. You can update the list by clicking on the “Get latest version online”.

hcl_download

 

If your vCenter does not have access to the internet, you can download/Upload the file manually, as follows:

  • Log in to a workstation with access to the internet
  • Go to https://partnerweb.vmware.com/service/vsan/all.json
  • Save the all.json file
  • From the same workstation connect to your vCenter, or you can copy the file to another workstation/server with access to vCenter
  • From Cluster>Manage>Settings>Health and Performance on the Web UI, select “update from file” and select the all.json file you downloaded

 

If your hardware/firmware/drivers are not compatible with the VCG, you will get a warning/error.

hcl_warning

 

There is a “fling” tool that will also accomplish this, but in addition, it will provide more information as to why there is a warning or error. The tool is called “vsan hardware compatibility list checker”, very clever name, right?! It is an executable that runs from a Windows command prompt, and produces a nice html report. You can download the tool from here

Once downloaded extract it on a window system, open command prompt and navigate to the location of the folder. Launch hclCheck with the necessary flags (e.g. –hostname, –help, etc.). In my case, I did this on my home lab, I am using self-signed certs so I had to use –noSSLVerify flag. Notice that this tool will download the latest version of the HCL DB and check against it.

hclcheck_cli

After a few seconds, the check is completed and a report is created on the current directory.

hclcheck_window

Double click on the file to open the report on your default browser.  One important piece to notice here is that the report also includes the PCI IDs for the device. So what? you may ask. Well, this can be used to check against the VCG, and get the correct firmware and driver info. If the VCG shows multiple instance of the same controller, SSD, etc., check the PCI IDs to pick the correct one and get the recommended driver and firmware version.

In this report, you can see that my home lab hardware is not supported for vSAN… it works, but not supported.

hclcheck_report

 

Example of multiple entries on VCG. Notice different SSID (Sub-device ID).

vcg_ssid