vSAN VCG Checks

One of the most important aspects of any storage solutions, involves utilizing hardware to its advantage. Many storage vendors have taken advantage of faster drives and other technologies to create fast storage solutions, and vSAN is no different. We will discuss why it is so important for vSAN to have compatible/supported hardware and how to check this.

One of the main requirements for VMware’s HCI solution is for hardware to be on its Hardware Compatibility List (HCL), also known as VMware Compatibility Guide (VCG). This compatibility guide will allow you to check existing hardware and/or hardware that you plan to purchase for vSAN. You can also check vSAN ready nodes against this guide.

Before you deploy vSAN, all hardware must have passed the compatibility test. This is to ensure that the best performance will be achieved, as well as reducing possible issues due to hardware. Hardware compatibility with vSAN includes but not limited to hard drives (MD), flash devices, storage controllers, etc. It is not only necessary for the hardware to be on the compatibility list, but also have the appropriate firmware and driver versions for the specific version of ESXi.

How to check hardware against VCG

You can check hardware, firmware and driver version by going to VMware’s VCG website here

You can also check compatibility of vSAN ready nodes at this site.

Once vSAN has been deployed, vSAN will check your hardware compatibility against the downloaded VCG version. You can also update the local VCG version from the Web UI. To make sure the HCL DB is up to date on your cluster go to Cluster>Manage>Settings>Health and Performance from the Web UI. You can update the list by clicking on the “Get latest version online”.

hcl_download

 

If your vCenter does not have access to the internet, you can download/Upload the file manually, as follows:

  • Log in to a workstation with access to the internet
  • Go to https://partnerweb.vmware.com/service/vsan/all.json
  • Save the all.json file
  • From the same workstation connect to your vCenter, or you can copy the file to another workstation/server with access to vCenter
  • From Cluster>Manage>Settings>Health and Performance on the Web UI, select “update from file” and select the all.json file you downloaded

 

If your hardware/firmware/drivers are not compatible with the VCG, you will get a warning/error.

hcl_warning

 

There is a “fling” tool that will also accomplish this, but in addition, it will provide more information as to why there is a warning or error. The tool is called “vsan hardware compatibility list checker”, very clever name, right?! It is an executable that runs from a Windows command prompt, and produces a nice html report. You can download the tool from here

Once downloaded extract it on a window system, open command prompt and navigate to the location of the folder. Launch hclCheck with the necessary flags (e.g. –hostname, –help, etc.). In my case, I did this on my home lab, I am using self-signed certs so I had to use –noSSLVerify flag. Notice that this tool will download the latest version of the HCL DB and check against it.

hclcheck_cli

After a few seconds, the check is completed and a report is created on the current directory.

hclcheck_window

Double click on the file to open the report on your default browser.  One important piece to notice here is that the report also includes the PCI IDs for the device. So what? you may ask. Well, this can be used to check against the VCG, and get the correct firmware and driver info. If the VCG shows multiple instance of the same controller, SSD, etc., check the PCI IDs to pick the correct one and get the recommended driver and firmware version.

In this report, you can see that my home lab hardware is not supported for vSAN… it works, but not supported.

hclcheck_report

 

Example of multiple entries on VCG. Notice different SSID (Sub-device ID).

vcg_ssid

 

 

HCIBench null Results? But it worked last week?!

hcibench_logoLast week a new version of HCIBench was released (version 1.5.0.5). If you are not familiar with HCIBench, this is a VMware Fling that gives you a nice web UI to conduct performance testing for vSAN environments. It leverages vdbench to create VMs and stress test vSAN. The reports generated in addition to realtime views of vSAN observer can give you a great look at what your vSAN cluster can do.

If you are running version 1.5.0.4, you may now encounter an issue where the test runs, but it doesn’t run as long as you told it to. It also displays zeros for results. You probably ran it prior to Nov. 1st as was fine, so what gives?

The issue is with the vm-template. HCIBench will spin up vdbench VMs during the test. The problem is, the password on those VMs has expired for version 1.5.0.4.

Symptoms:
==========================================
In host-ESXi_IP-vm-deploy.log or vc-VC_IP-vm-deploy.log you will see the err msg: “no such mark “~pvscsi””
Or in all-in-one-testing.log, you will see err msg: “Too many authentication failures”
Or in io-test log you will see err msg: “Net::SCP::Error: SCP did not finish successfully (1)”
Your test will finish with 0s results.
==========================================

What to do?

You have 2 choices. You can either upgrade HCIBench to version 1.5.0.5 or replace the vm-template file within HCIBench. Ideally, you will upgrade, since there are more fixes on this new release.

Upgrade Path – Download HCIBench ova from https://labs.vmware.com/flings/hcibench#summary and deploy.

Workaround – If you are not willing to upgrade, we are providing the vm-template file in the download, vm-template.tar.gz, please download this file, upload it to HCIBench:/root/, in HCIBench command line, run 

“tar -zxvf /root/vm-template.tar.gz ; mv -f vm-template/* /opt/output/vm-template/”

hcib_download

OR you can resolve the issue by yourself:

1. Deploy the perf-photon-vdbench vm from http://HCIBENCH_IP/vm-template/perf-photon-vdbench.ovf (KEEP THE VM NAME AS perf-photon-vdbench )
2. log into perf-photon-vdbench vm using root/vdbench and run “chage -I -1 -m 0 -M 99999 -E -1 root”
3. shutdown perf-photon-vdbench vm
4. Login into HCIBench and run “rvc ‘VC_USERNAME’@VCENTER_IP”
5. In RVC, go to /VC/DATACENTER/ and run “ovf.download /opt/output/vm-template vms/perf-photon-vdbench”, after downloading, exit rvc by typing “exit”
6. Run “mv /opt/output/vm-template/perf-photon-vdbench/* /opt/output/vm-template” and “chmod 755 /opt/output/vm-template/*”

 

List of Fixes on version 1.5.0.5

  • Increased Timeout value of client VM disk from 30 seconds to 180 seconds.
  • Disabled client VM password expiration.
  • Disabled client VM OS disk fsck.
  • Set Observer interval to 60 seconds to shrink the size of observer data.
  • Fixed PCPU calculation.
  • Created link directory of /opt/automation/logs, user will be able to review the testing logs in http://HCIBENCH/hcibench_logs/
  • Increased the RAM of HCIBench from 4GB to 8GB to avoid running out-of-resource issue.

ESXTOP not displaying properly?

I’ve seen quite a few posts lately about ESXTOP not displaying properly. Long story short, esxtop does not display the interactive UI and displays the CSV output instead.

If your esxtop looks like this, then you need to change the terminal declaration to something like xterm. Notice here (red rectangle), how the terminal is set to xterm-256color.

xterm-256color

 

You can change the terminal declaration from the cli, but this is not persistent through sessions.

To do this simply type “TERM=xterm“.

To display the current terminal declaration type “echo $TERM

termxterm

 

 

This will display esxtop interface properly.

esxtop

 

 

 

 

 

 

 

If you want this change to persist, just change your favorite terminal settings to xterm from its current setting. For example, I use my Mac’s terminal to ssh into my lab, the terminal is set to xterm-256color, which causes the display issue. So, I just opened the terminal preferences and changed the declaration to xterm. By default, putty identifies itself as xterm(1), so no need to change that. If putty is set to something else, then you can change the terminal-type string from the Connection>Data section.

term_declaration

 

VSAN 6.2 Performance Degradation (Hybrid)

In vSAN (not misspelled) 6.2, dedup and compression was introduced. These features; however, only apply to all-flash configurations and must not be set up on Hybrid environments.

Some customers have experienced performance degradation on 6.2 Hybrid environments when compared to 6.0 or 6.1 performance. Read caching performance degradation can be observed for Hybrid Disk Groups on the SSD cache tier, due to a low level scanning for unique blocks (dedup). Although this is normal for All-Flash environments, it is important to check your hosts participating on a Hybrid Cluster, to make sure this is turned OFF.

To check/change this option, you can use the ESXi Shell or PowerCli.

The setting would show “2” if it is turned ON, and “0” if it is turned OFF. It should be set to “0” for EACH Hybrid host.

Check Setting

ESXi Shell – esxcfg-advcfg -g /LSOM/lsomComponentDedupScanType 

lsom_shell_check

 

 

PowerCli – Get-VMHost<HostName> | Get-AdvancedSetting –Name LSOM.lsomComponentDedupScanType

lsom_pcli_check

 

 

 

Change Setting

ESXi Shell – esxcfg-advcfg -s 0 /LSOM/lsomComponentDedupScanType 

lsom_shell_change

 

 

PowerCli – Get-VMHost <HostName> | Get-AdvancedSetting -Name LSOM.lsomComponentDedupScanType | Set-AdvancedSetting -Value “0”

lsom_pcli_change

 

Using PowerCli is my preference, since you won’t have to enable SSH on the hosts, and you can use wildcards to check/change all the hosts with little effort.

VSAN Proactive Rebalance

balance1There has been a lot of questions as to what happens when a rebalance task is triggered in VSAN. By default, VSAN will try to do a proactive rebalance of the objects as the disks start hitting certain thresholds (80%). There are instances, during failures/rebuilds, or even when organic imbalance is discovered, where administrators may trigger a proactive rebalance task.

What happens

Once you click on the “balance disks” button. You are opening a 24-hr window where rebalance will take place. This means that the rebalance operation may take up to 24 hours, so be patient. Many people have voiced frustration because the UI shows a 5% progress (or lack there of) for a very long time, almost appearing as it is stuck. The rebalance is taking place on the background.

You may also not see any progress at all for the first 30 minutes. This is because VSAN wants to wait to make sure that the imbalance persists before it attempts to move any objects around. After all, the rebalance task is moving objects between disks/nodes, so copying data over the network will take resources, bandwidth and time; so plan accordingly if you must rebalance.

Background Tasks:

  • Task at 1 percent when created.
  • Task at 5 percent when rebalance command is triggered.
  • Then waits for the rebalance to complete before setting the percent done to 100.
    • During the waiting period, it will check to see if rebalance is done (clom-tool command).
    • If not done, it will sleep for 100 seconds and check again if rebalance is done.

By default when triggered from the VC UI, the task will run for 24 hours or whenever the rebalance effort is done, whichever comes first.

Notice that if your disks are balanced, the button is greyed out to avoid unnecessary object “shuffling”.

rebalance