vSAN Encryption at Rest & In Transit: What is the difference?

In the past, I’ve written a few posts about vSAN Data-at-Rest Encryption, which became available with vSAN 6.6. You can find those posts here. In vSAN version 7.0U1 there is a new option for encryption, Data-In- Transit Encryption. So what is the difference? Can I only choose one or both? Let’s find out.

vSAN Data at Rest Encryption

Data-at-rest (D@RE) was designed to do just that. Encrypt all your data once it lands on the disks being used by vSAN. This will work regardless the Storage Policy you choose, and all the data replicas will be encrypted at both the cache layer and the capacity layer. One major advantage of Data-at-Rest Encryption over the vSphere VM encryption is that vSAN will still allow you to encrypt your data and take advantage of space saving features such as deduplication and compression. When the data lands in cache it will be encrypted using the Data Encryption Key (DEK), then while the data is being destaged to the capacity layer it will be decrypted, and it is here where the deduplication and compression takes place. Finally when the data lands in the capacity devices, the data gets encrypted once again. It is also important to highlight that the DEK is protected by the Key Encryption Key (KEK) which is coming from the Key Management Server (KMS)… and this is one of the differences between the two options.

vSAN Data in Transit Encryption

Data-In-Transit Encryption (DIT) comes in to complete the end-to-end encryption of the data while in transit between hosts. Data-at-Rest encryption only encrypts the data when it lands on disk, so if someone takes a disk out of a server, all data is encrypted. But what about other attacks such as Man-in-the-middle attacks? Well, this is where Data-In-Transit encryption can protect the data. The keys used for DIT encryption are managed internally and there is no need for a KMS. Such keys are also rotated much, much faster when compared with D@RE. DIT encryption keys are rotated weekly by default, but you can change this option and rotate keys either every 7 days or every 6 hours or something in between. Just like D@RE encryption, DIT encryption works at a vSAN cluster level; so either all the hosts are protected or none.

Here is a quick comparison between the two options

FAQ

Can I enable both at the same time?

Yes. You can enable Data at rest and Data in Transit encryption in order to get full protection in your vSAN environment. It is recommended to enable vSAN Data at Rest encryption in the early stages of the cluster to minimize the time for on-disk formatting as there is less data to move around.

What is the performance impact of turning encryption on?

There are a lot of variables that come in to play when we talk about performance. However; vSAN encryption (both) will take advantage of AES-NI and offload operations in order to reduce any performance hit. Most modern CPU have AES-NI, but sometimes this feature is not enabled, so make sure to check this at deployment. Please also be mindful that enabled D@RE when the cluster has a lot data in it will result in large amounts of data being moved, so plan this to be done during off hours if possible.

What vSAN License do I need to enable vSAN Encryption?

In order to enable Data-at-Rest and/or Data-In-Transit Encryption you will need vSAN Enterprise or vSAN Enterprise Plus licenses. Refer to licensing guide here.

How do I enable Data-In-Transit Encryption?

Enabling DIT encryption is easy. Within the vCenter UI, select the vSAN cluster > Configure > Services > Data-In-Transit can be enable with or without Data-at-Rest encryption. Here is where you can also change the key rotation schedule for the DIT encryption keys.

@GreatWhiteTec

vSAN Encryption KMS info retrieval

A few years ago I wrote a blog post about “Replacing vCenter with vSAN Encryption Enabled“. For this particular exercise, one key piece of information needed to be retrieved was the kmipClusterId.

A couple of things have changed since then, in newer version of vSAN.

Change #1: ESXCLI commands

An easier way to retrieve this information with esxcli command was added. This command allows you to obtain a lot of information about the state of vSAN encryption, retrieve the hostKeyId, kekID, etc.

esxcli vsan encryption <option> get/list

 

So, based on this addition, you can now get the kmipClusterId needed for vCenter replacement by using esxcli vsan encryption kms list

As you can see, you can still look for this information on the esx.conf file which is where the hosts store this information for this particular version of vSAN (6.7 P01 – Build 15160138). Which brings me to the second update…

 

Change #2: vSAN Persistence

In vSAN 7.0 and beyond some changes were made on how this configuration gets stored. In this case, the encryption information that was previously file based (esx.conf) is now stored in a database. This provides better concurrency for multiple readers and writers versus the file based esx.conf option, among other advantages.

The good news is that the esxcli vsan encryption command will still allow you to retrieve the information needed in regards to encryption. However, if you attempt to retrieve this information from the esx.conf file, you won’t be able to find it there anymore.

Alternatively, you can retrieve the information directly from the config-store… maybe more info than you need. So, I’ld just stick to esxcli commands.

VCF Lab Tips: NSX Cluster size

VMware Cloud Foundation (VCF) is quickly becoming the go-to for many companies. The operations efficiency it brings, along with its best practices driven architecture is a no-brainer when it comes to value. As with any purchases, many people like to kick the tires on a new product, or just want to get familiar with it via Proof of Concepts, virtual labs, home labs, etc. Testing VCF is a great way to learn it, but because it uses best practices (VMware Validated Designs), some decisions are made for you, one of them is the NSX Cluster.

To make matters simple, I will refer to VCF 4.0+, where NSX-T is used for Management and VI Workload Domains… no more NSX-V. To deploy VCF we use a worksheet we can download from my.vmware.com 

This worksheet will deploy 3 NSX-T Managers and create a cluster under a Virtual IP (VIP). The NSX-T Managers are “t-shirt” sizes, by default deploying Medium NSX-T Appliances, but they can be changed to either Large or Small on the worksheet.

 

As you can see from the worksheet, it requires 3 NSX-T Managers to be deployed. So here is where we can use other avenues to reduce that resource consumption.

TIP 1:

If you wish to deploy all 3 NSX-T Managers, you can change the size to small on the worksheet in order to reduce the resource footprint, prior to VCF bring-up.

 

TIP 2:

This second option allows for setting the size to small and at the same time allows to also create a single node cluster. This can be done by using a json file during VCF bring-up rather than using the worksheet. Within the json file, remove any additional entries of NSX-T Managers and leave only one node.

For additional information on how to obtain the json file, you can find the procedure here.

 

TIP 3:

Another option relates to a post bring-up procedure. In the case that VCF has already been stood up, and resources want to be minimized within the lab, the option to remove nodes from the NSX Cluster would be a viable solution. Removing nodes from the NSX cluster can be done from CLI within the NSX cluster.

It is necessary to SSH into one of the cluster nodes in order to remove nodes from the NSX cluster. If unable to SSH, verify that the AllowRootAccess is enabled and StrictMode is set to no. Then restart the ssh service with the following command:

/etc/init.d/ssh restart

Then ssh into that node using the admin account. Once logged in, there are a list of command available, including get and detach.

 

Use the GET command to get the ID of the cluster nodes.

get cluster status

 

Use the ID along with the detach cmd to remove a specific node. Repeat the process to remove the the second node until there is only one left.

detach node <node-id>

 

I want to reiterate that this is a good resource saving workaround on a LAB environment. For production environments, please follow the already applied recommendations/best practices for deploying VCF.

VCF: Generate JSON File from Excel Spreadsheet

Performing VCF bring-up includes “feeding” Cloud Builder with all the information needed to deploy all the components automagically for you, including vCenter, vSAN, NSX, SDDC Manager and configuring all to actually work together. Sounds too good to be true, but it is… it is true indeed.

I personally like to use json files as it gives me an easier way to replace IPs, passwords, etc, as well as change size and a number of components. More on that later…

Once you deploy the Cloud Builder appliance, you can use the completed worksheet or you can use a json file. Cloud Builder provides a way to convert your Excel spreadsheet into a json file via a python script.

There is an official document on this procedure, and probably a couple of blog posts out there; however, there was a recent move to Python3, so the syntax has changed a bit.

Here are the steps to generate the json file with python3:

  • Use a file transfer utility (WinSCP) to copy the Excel file to Cloud Builder.
    • Log in with Cloud Builder admin account
    • copy/past excel file from your computer to /home/admin

  • Copy the Excel file from /home/admin to /opt/vmware/sddc-support using sudo command to gain access to the destination, or switch to root (su)
    • sudo cp <file-name.xlsx> /opt/vmware/sddc-support

 

  • Change directory to /opt/vmware/sddc-support and verify the xlsx file was copied successfully

  • Then run the following command to generate the json file
    • sudo python3 -m cloud_admin_tools.JsonGenerator.JsonGenerator -t cloud_admin_tools/JsonGenerator/template -d vcf-public-ems -i <file-name-path.xlsx> 
    • you can use -h to see other flags available
    • you should see the json file being generated

 

 

 

 

 

  • Navigate to your output location or if you used the default, navigate to /opt/vmware/Resources/<design chosen (vcf-ems, vcf-public-ems)>
  • verify that you see the json file and you can then read it
  • cat vcf-public-ems.json
  • You may need to switch to root user to access the output directory.

 

  • You can also assign a output directory with the flag -o or –output
    • In this example I set the json file directory to be /home/admin/

 

  • You can now export the file and use it as the input file during Cloud Builder bring-up workflow

 

 

 

 

@GreatWhiteTec

 

 

VUM Host upgrade fails – Another task is already in progress

I came across this issue a few times already, so I figured it would be good to share the findings, especially since it took me down the rabbit hole.

First off, the error “Another task is already in progress” is kind of obvious but yet does not provide enough detail. I ran into this trying to upgrade 4 hosts in a cluster via VUM, no tasks were running at the time of the upgrade start. This was initiated from VMware Cloud Foundation. Thinking the issue was with VCF I tried to do a direct upgrade from VUM, skipping VCF… but received the same error. Also tested with/without vSAN. NFS, etc.

The issue appeared to be related to a race condition where both the VUM remediation AND this “other task” were trying to run at the same time on that host. From the task console we can see that there was another “install” operation running as soon as VUM was trying to remediate. The Initiator give me a hint, and we can see vcHms “cut in line” a did an install operation.

I started looking into the esxupdate log (/var/log/esxupdate.log), and noticed that the host was trying to install a vib. But this vib was already installed, so what gives?

The name of the initiator and one of the vibs gave me a clue (vr2c-firewall.vib). I was running vSphere Replication on that particular cluster, so I started digging in. To validate my suspicion, I shutdown the VR appliance and attempted to upgrade one host. The upgrade worked as expected with no errors, so I was pretty certain something within vSphere Replication was causing this.

First I needed to ssh into the VR appliance or use the console. You won’t be able to ssh into the appliance before enabling ssh. Log in to the console with the root password and run the script to enable ssh.

/usr/bin/enable-sshd.sh

Within the appliance I looked into the config settings for HMS and discovered that there were 2 vibs that were set to auto install at host reconnect. So it appears there is a race condition when VUM and VR and both trying to do an install task at the same time (at reconnect). Makes perfect sense now.

VR_vib_autoinstall

The ESXI update logs indicated that the vr2c-firewall.vib was the one trying to install. After re-checking the vibs (esxcli software vib list) on all the hosts, I did see this vib was already installed but the task from VR kept trying to install at reconnect.

As I workaround, I decided to disable the auto install of this particular vib by running the following command within the /opt/vmware/hms/bin directory and then restart the hms service:

./hms-configtool -cmd reconfig -property hms-auto-install-vr2c-vib=false

service hms restart

hms-configtool

This workaround worked as a charm and I was able to upgrade the rest of the cluster using VUM. I did not find an official KB about this, and this is by no means an official workaround/fix.

Disclaimer: If you plan to implement this fix, be aware that this is not an official VMware blog, and changes to products may or may not cause issues and/or affect support.