vSAN Encryption KMS info retrieval

A few years ago I wrote a blog post about “Replacing vCenter with vSAN Encryption Enabled“. For this particular exercise, one key piece of information needed to be retrieved was the kmipClusterId.

A couple of things have changed since then, in newer version of vSAN.

Change #1: ESXCLI commands

An easier way to retrieve this information with esxcli command was added. This command allows you to obtain a lot of information about the state of vSAN encryption, retrieve the hostKeyId, kekID, etc.

esxcli vsan encryption <option> get/list

 

So, based on this addition, you can now get the kmipClusterId needed for vCenter replacement by using esxcli vsan encryption kms list

As you can see, you can still look for this information on the esx.conf file which is where the hosts store this information for this particular version of vSAN (6.7 P01 – Build 15160138). Which brings me to the second update…

 

Change #2: vSAN Persistence

In vSAN 7.0 and beyond some changes were made on how this configuration gets stored. In this case, the encryption information that was previously file based (esx.conf) is now stored in a database. This provides better concurrency for multiple readers and writers versus the file based esx.conf option, among other advantages.

The good news is that the esxcli vsan encryption command will still allow you to retrieve the information needed in regards to encryption. However, if you attempt to retrieve this information from the esx.conf file, you won’t be able to find it there anymore.

Alternatively, you can retrieve the information directly from the config-store… maybe more info than you need. So, I’ld just stick to esxcli commands.

VCF Lab Tips: NSX Cluster size

VMware Cloud Foundation (VCF) is quickly becoming the go-to for many companies. The operations efficiency it brings, along with its best practices driven architecture is a no-brainer when it comes to value. As with any purchases, many people like to kick the tires on a new product, or just want to get familiar with it via Proof of Concepts, virtual labs, home labs, etc. Testing VCF is a great way to learn it, but because it uses best practices (VMware Validated Designs), some decisions are made for you, one of them is the NSX Cluster.

To make matters simple, I will refer to VCF 4.0+, where NSX-T is used for Management and VI Workload Domains… no more NSX-V. To deploy VCF we use a worksheet we can download from my.vmware.com 

This worksheet will deploy 3 NSX-T Managers and create a cluster under a Virtual IP (VIP). The NSX-T Managers are “t-shirt” sizes, by default deploying Medium NSX-T Appliances, but they can be changed to either Large or Small on the worksheet.

 

As you can see from the worksheet, it requires 3 NSX-T Managers to be deployed. So here is where we can use other avenues to reduce that resource consumption.

TIP 1:

If you wish to deploy all 3 NSX-T Managers, you can change the size to small on the worksheet in order to reduce the resource footprint, prior to VCF bring-up.

 

TIP 2:

This second option allows for setting the size to small and at the same time allows to also create a single node cluster. This can be done by using a json file during VCF bring-up rather than using the worksheet. Within the json file, remove any additional entries of NSX-T Managers and leave only one node.

For additional information on how to obtain the json file, you can find the procedure here.

 

TIP 3:

Another option relates to a post bring-up procedure. In the case that VCF has already been stood up, and resources want to be minimized within the lab, the option to remove nodes from the NSX Cluster would be a viable solution. Removing nodes from the NSX cluster can be done from CLI within the NSX cluster.

It is necessary to SSH into one of the cluster nodes in order to remove nodes from the NSX cluster. If unable to SSH, verify that the AllowRootAccess is enabled and StrictMode is set to no. Then restart the ssh service with the following command:

/etc/init.d/ssh restart

Then ssh into that node using the admin account. Once logged in, there are a list of command available, including get and detach.

 

Use the GET command to get the ID of the cluster nodes.

get cluster status

 

Use the ID along with the detach cmd to remove a specific node. Repeat the process to remove the the second node until there is only one left.

detach node <node-id>

 

I want to reiterate that this is a good resource saving workaround on a LAB environment. For production environments, please follow the already applied recommendations/best practices for deploying VCF.

VCF: Generate JSON File from Excel Spreadsheet

Performing VCF bring-up includes “feeding” Cloud Builder with all the information needed to deploy all the components automagically for you, including vCenter, vSAN, NSX, SDDC Manager and configuring all to actually work together. Sounds too good to be true, but it is… it is true indeed.

I personally like to use json files as it gives me an easier way to replace IPs, passwords, etc, as well as change size and a number of components. More on that later…

Once you deploy the Cloud Builder appliance, you can use the completed worksheet or you can use a json file. Cloud Builder provides a way to convert your Excel spreadsheet into a json file via a python script.

There is an official document on this procedure, and probably a couple of blog posts out there; however, there was a recent move to Python3, so the syntax has changed a bit.

Here are the steps to generate the json file with python3:

  • Use a file transfer utility (WinSCP) to copy the Excel file to Cloud Builder.
    • Log in with Cloud Builder admin account
    • copy/past excel file from your computer to /home/admin

  • Copy the Excel file from /home/admin to /opt/vmware/sddc-support using sudo command to gain access to the destination, or switch to root (su)
    • sudo cp <file-name.xlsx> /opt/vmware/sddc-support

 

  • Change directory to /opt/vmware/sddc-support and verify the xlsx file was copied successfully

  • Then run the following command to generate the json file
    • sudo python3 -m cloud_admin_tools.JsonGenerator.JsonGenerator -t cloud_admin_tools/JsonGenerator/template -d vcf-public-ems -i <file-name-path.xlsx> 
    • you can use -h to see other flags available
    • you should see the json file being generated

 

 

 

 

 

  • Navigate to your output location or if you used the default, navigate to /opt/vmware/Resources/<design chosen (vcf-ems, vcf-public-ems)>
  • verify that you see the json file and you can then read it
  • cat vcf-public-ems.json
  • You may need to switch to root user to access the output directory.

 

  • You can also assign a output directory with the flag -o or –output
    • In this example I set the json file directory to be /home/admin/

 

  • You can now export the file and use it as the input file during Cloud Builder bring-up workflow

 

 

 

 

@GreatWhiteTec

 

 

VUM Host upgrade fails – Another task is already in progress

I came across this issue a few times already, so I figured it would be good to share the findings, especially since it took me down the rabbit hole.

First off, the error “Another task is already in progress” is kind of obvious but yet does not provide enough detail. I ran into this trying to upgrade 4 hosts in a cluster via VUM, no tasks were running at the time of the upgrade start. This was initiated from VMware Cloud Foundation. Thinking the issue was with VCF I tried to do a direct upgrade from VUM, skipping VCF… but received the same error. Also tested with/without vSAN. NFS, etc.

The issue appeared to be related to a race condition where both the VUM remediation AND this “other task” were trying to run at the same time on that host. From the task console we can see that there was another “install” operation running as soon as VUM was trying to remediate. The Initiator give me a hint, and we can see vcHms “cut in line” a did an install operation.

I started looking into the esxupdate log (/var/log/esxupdate.log), and noticed that the host was trying to install a vib. But this vib was already installed, so what gives?

The name of the initiator and one of the vibs gave me a clue (vr2c-firewall.vib). I was running vSphere Replication on that particular cluster, so I started digging in. To validate my suspicion, I shutdown the VR appliance and attempted to upgrade one host. The upgrade worked as expected with no errors, so I was pretty certain something within vSphere Replication was causing this.

First I needed to ssh into the VR appliance or use the console. You won’t be able to ssh into the appliance before enabling ssh. Log in to the console with the root password and run the script to enable ssh.

/usr/bin/enable-sshd.sh

Within the appliance I looked into the config settings for HMS and discovered that there were 2 vibs that were set to auto install at host reconnect. So it appears there is a race condition when VUM and VR and both trying to do an install task at the same time (at reconnect). Makes perfect sense now.

VR_vib_autoinstall

The ESXI update logs indicated that the vr2c-firewall.vib was the one trying to install. After re-checking the vibs (esxcli software vib list) on all the hosts, I did see this vib was already installed but the task from VR kept trying to install at reconnect.

As I workaround, I decided to disable the auto install of this particular vib by running the following command within the /opt/vmware/hms/bin directory and then restart the hms service:

./hms-configtool -cmd reconfig -property hms-auto-install-vr2c-vib=false

service hms restart

hms-configtool

This workaround worked as a charm and I was able to upgrade the rest of the cluster using VUM. I did not find an official KB about this, and this is by no means an official workaround/fix.

Disclaimer: If you plan to implement this fix, be aware that this is not an official VMware blog, and changes to products may or may not cause issues and/or affect support.

Devices unavailable for vSAN

I’ve been experiencing this scenario quite a bit lately, so I figured I’ll write something about it, also helps me refresh my memory.

Lately I’ve been helping out with a lot of VMware Cloud Foundation Proof of Concepts (VCF POCs)… That’s a mouthful!

Upon inspection of the environment I am supposed to work on, I have found hosts that were once part of a vSAN cluster but were not properly cleaned up prior to ESXi rebuild.

The vSAN clean up process involves the following steps:

  1. Set host in Maintenance Mode
  2. Delete Disk Group(s)
    • This removes vSAN partitions as well
  3. Remove host from vSAN cluster
  4. Clean up network
  5. Remove from vCenter

As previously mentioned, I usually get called in after the vCenter is gone and hosts have been re-imaged. Now what?!?!

The steps to manually clean the hosts involve the following tasks:

Delete Disk Groups via CLI

Since we do not have access to vCenter in this case, we can delete the Disk Group via CLI using esxcli commands.

You can remove the disks by specifying the ssd cache device to remove that device and all backing capacity devices, or you can do it one device at a time by using the device’s uuid.

To remove the devices one a time you can use vdq -q command to list the devices and then use esxcli vsan storage remove -u <uuid>. I prefer the other method since it is a lot faster…

  • List devices by using vdq -i

  • Run esxcli vsan storage remove -s <cache device>
  • In this case we will run this command for naa.55cd2e404b66fcbb and naa.55cd2e404b4393b9

 

Now What?!?!

Now that we have deleted the disk groups, we need to make sure that the host doesn’t think it is still part of a vSAN cluster.

Remove Host from vSAN Cluster via CLI

  • Check vSAN cluster membership
    • esxcli vsan cluster get
  • If it belongs to a cluster, remove it
    • esxcli vsan cluster leave

Clean up Network configuration

Since the hosts were re-imaged, this should be all set to default, but in case this was not cleaned up, or was manually re-created, you can clean it up by resetting the configuration.

The following command resets the root, password, overrides all the network configuration changes, and reboots the host. Don’t do this unless you are planning to reset the host to default… You’ve been warned.

/bin/firmwareConfig.py –reset

 

At this point, you can use the cleaned host to either join a new vSAN cluster, commission a host in VCF SDDC Manager, or use it as part of the VCF bring up process for the Management Domain.

 

@GreatWhiteTec