vSAN Stats Object Out of Date

Several people asked me this question several times, so I figured I’ld write a quick post about it.

When the default vSAN policy was being changed, people started noticing that the Stats Objects (Health) will show as “Out of Date”, even though the policy was applied at the end of the wizard.

A few things to keep in mind:

  • The Stats Object is exactly that, an Object, just like a VM home folder, or VMDK.
    • That object is associated with a Policy, usually the default vSAN policy
  • If you change a policy, you can apply this immediately through the wizard
    •  However, this applies the policy to the VMs (objects)
    • Stats Object is not part of any VM
  • If you change the policy that the Stats Object is using or sharing with VMs, then you will need to manually re-apply that policy to the Stats Object.

Scenario

  1. Policy change (Default in this case)
  2. Reapply Policy to VMs now
  3. Stats Object show “Out of Date”
  4. Edit the Storage Policy under Health and Performance and click OK
  5. This will bring the Object back to compliance

pol_apply_now

 

 

 

 

 

out-of-date

stats-compliant

 

 

 

 

Quick Video about it

vSAN VCG Checks

One of the most important aspects of any storage solutions, involves utilizing hardware to its advantage. Many storage vendors have taken advantage of faster drives and other technologies to create fast storage solutions, and vSAN is no different. We will discuss why it is so important for vSAN to have compatible/supported hardware and how to check this.

One of the main requirements for VMware’s HCI solution is for hardware to be on its Hardware Compatibility List (HCL), also known as VMware Compatibility Guide (VCG). This compatibility guide will allow you to check existing hardware and/or hardware that you plan to purchase for vSAN. You can also check vSAN ready nodes against this guide.

Before you deploy vSAN, all hardware must have passed the compatibility test. This is to ensure that the best performance will be achieved, as well as reducing possible issues due to hardware. Hardware compatibility with vSAN includes but not limited to hard drives (MD), flash devices, storage controllers, etc. It is not only necessary for the hardware to be on the compatibility list, but also have the appropriate firmware and driver versions for the specific version of ESXi.

How to check hardware against VCG

You can check hardware, firmware and driver version by going to VMware’s VCG website here

You can also check compatibility of vSAN ready nodes at this site.

Once vSAN has been deployed, vSAN will check your hardware compatibility against the downloaded VCG version. You can also update the local VCG version from the Web UI. To make sure the HCL DB is up to date on your cluster go to Cluster>Manage>Settings>Health and Performance from the Web UI. You can update the list by clicking on the “Get latest version online”.

hcl_download

 

If your vCenter does not have access to the internet, you can download/Upload the file manually, as follows:

  • Log in to a workstation with access to the internet
  • Go to https://partnerweb.vmware.com/service/vsan/all.json
  • Save the all.json file
  • From the same workstation connect to your vCenter, or you can copy the file to another workstation/server with access to vCenter
  • From Cluster>Manage>Settings>Health and Performance on the Web UI, select “update from file” and select the all.json file you downloaded

 

If your hardware/firmware/drivers are not compatible with the VCG, you will get a warning/error.

hcl_warning

 

There is a “fling” tool that will also accomplish this, but in addition, it will provide more information as to why there is a warning or error. The tool is called “vsan hardware compatibility list checker”, very clever name, right?! It is an executable that runs from a Windows command prompt, and produces a nice html report. You can download the tool from here

Once downloaded extract it on a window system, open command prompt and navigate to the location of the folder. Launch hclCheck with the necessary flags (e.g. –hostname, –help, etc.). In my case, I did this on my home lab, I am using self-signed certs so I had to use –noSSLVerify flag. Notice that this tool will download the latest version of the HCL DB and check against it.

hclcheck_cli

After a few seconds, the check is completed and a report is created on the current directory.

hclcheck_window

Double click on the file to open the report on your default browser.  One important piece to notice here is that the report also includes the PCI IDs for the device. So what? you may ask. Well, this can be used to check against the VCG, and get the correct firmware and driver info. If the VCG shows multiple instance of the same controller, SSD, etc., check the PCI IDs to pick the correct one and get the recommended driver and firmware version.

In this report, you can see that my home lab hardware is not supported for vSAN… it works, but not supported.

hclcheck_report

 

Example of multiple entries on VCG. Notice different SSID (Sub-device ID).

vcg_ssid

 

 

HCIBench null Results? But it worked last week?!

hcibench_logoLast week a new version of HCIBench was released (version 1.5.0.5). If you are not familiar with HCIBench, this is a VMware Fling that gives you a nice web UI to conduct performance testing for vSAN environments. It leverages vdbench to create VMs and stress test vSAN. The reports generated in addition to realtime views of vSAN observer can give you a great look at what your vSAN cluster can do.

If you are running version 1.5.0.4, you may now encounter an issue where the test runs, but it doesn’t run as long as you told it to. It also displays zeros for results. You probably ran it prior to Nov. 1st as was fine, so what gives?

The issue is with the vm-template. HCIBench will spin up vdbench VMs during the test. The problem is, the password on those VMs has expired for version 1.5.0.4.

Symptoms:
==========================================
In host-ESXi_IP-vm-deploy.log or vc-VC_IP-vm-deploy.log you will see the err msg: “no such mark “~pvscsi””
Or in all-in-one-testing.log, you will see err msg: “Too many authentication failures”
Or in io-test log you will see err msg: “Net::SCP::Error: SCP did not finish successfully (1)”
Your test will finish with 0s results.
==========================================

What to do?

You have 2 choices. You can either upgrade HCIBench to version 1.5.0.5 or replace the vm-template file within HCIBench. Ideally, you will upgrade, since there are more fixes on this new release.

Upgrade Path – Download HCIBench ova from https://labs.vmware.com/flings/hcibench#summary and deploy.

Workaround – If you are not willing to upgrade, we are providing the vm-template file in the download, vm-template.tar.gz, please download this file, upload it to HCIBench:/root/, in HCIBench command line, run 

“tar -zxvf /root/vm-template.tar.gz ; mv -f vm-template/* /opt/output/vm-template/”

hcib_download

OR you can resolve the issue by yourself:

1. Deploy the perf-photon-vdbench vm from http://HCIBENCH_IP/vm-template/perf-photon-vdbench.ovf (KEEP THE VM NAME AS perf-photon-vdbench )
2. log into perf-photon-vdbench vm using root/vdbench and run “chage -I -1 -m 0 -M 99999 -E -1 root”
3. shutdown perf-photon-vdbench vm
4. Login into HCIBench and run “rvc ‘VC_USERNAME’@VCENTER_IP”
5. In RVC, go to /VC/DATACENTER/ and run “ovf.download /opt/output/vm-template vms/perf-photon-vdbench”, after downloading, exit rvc by typing “exit”
6. Run “mv /opt/output/vm-template/perf-photon-vdbench/* /opt/output/vm-template” and “chmod 755 /opt/output/vm-template/*”

 

List of Fixes on version 1.5.0.5

  • Increased Timeout value of client VM disk from 30 seconds to 180 seconds.
  • Disabled client VM password expiration.
  • Disabled client VM OS disk fsck.
  • Set Observer interval to 60 seconds to shrink the size of observer data.
  • Fixed PCPU calculation.
  • Created link directory of /opt/automation/logs, user will be able to review the testing logs in http://HCIBENCH/hcibench_logs/
  • Increased the RAM of HCIBench from 4GB to 8GB to avoid running out-of-resource issue.

VSAN Proactive Rebalance

balance1There has been a lot of questions as to what happens when a rebalance task is triggered in VSAN. By default, VSAN will try to do a proactive rebalance of the objects as the disks start hitting certain thresholds (80%). There are instances, during failures/rebuilds, or even when organic imbalance is discovered, where administrators may trigger a proactive rebalance task.

What happens

Once you click on the “balance disks” button. You are opening a 24-hr window where rebalance will take place. This means that the rebalance operation may take up to 24 hours, so be patient. Many people have voiced frustration because the UI shows a 5% progress (or lack there of) for a very long time, almost appearing as it is stuck. The rebalance is taking place on the background.

You may also not see any progress at all for the first 30 minutes. This is because VSAN wants to wait to make sure that the imbalance persists before it attempts to move any objects around. After all, the rebalance task is moving objects between disks/nodes, so copying data over the network will take resources, bandwidth and time; so plan accordingly if you must rebalance.

Background Tasks:

  • Task at 1 percent when created.
  • Task at 5 percent when rebalance command is triggered.
  • Then waits for the rebalance to complete before setting the percent done to 100.
    • During the waiting period, it will check to see if rebalance is done (clom-tool command).
    • If not done, it will sleep for 100 seconds and check again if rebalance is done.

By default when triggered from the VC UI, the task will run for 24 hours or whenever the rebalance effort is done, whichever comes first.

Notice that if your disks are balanced, the button is greyed out to avoid unnecessary object “shuffling”.

rebalance

 

VSAN 6.2 Disk Format Upgrade Fails

error_image I’ve been doing quite a bit of VSAN deployments and upgrades lately. When upgrading up to version 6.1, I did not encounter any issues, luckily. Upgrading VSAN cluster (vSphere) to 6.2 also very smooth; however, while upgrading the disk format from version 2 or 2.5 to version 3, I’ve encountered a few errors. Here are some of the errors I came across.

The first issue was related to inaccessible objects in VSAN.

Cannot upgrade the cluster. Object(s) xxxxx are inaccessible in Virtual SAN.

VSAN_Inaccessible_Obj

This is actually not a new issue.  These inaccessible objects are stranded vswap files that need to be removed. In order to correct this issue, you will need to connect to your vCenter using RVC tools. The RVC command to run is: vsan.purge_inaccessible_vswp_objects 

VSAN_purge_inaccessible

 

The second issue I ran into was related to failed object realignment. Error:

Failed to realign following Virtual SAN objects…. due to being locked or lack of vmdk descriptor file, which requires manual fix.

VMware has acknowledged the issue and has created a python script to correct this issue. The script and instructions can be found on KB2144881

The script needs to be run from the ESXi host shell with the command below, after you have copied the script to a datastore that the host has access to. The script name is VsanRealign.py, but if you rename the file, you will obviously need to use the correct name instead. NOTE: The script takes quite a while to run, so just let it go until it finishes.

python VsanRealign.py precheck

VSAN_realign

Here the script takes care of the descriptor file issue once you answer yes. In this case, the object is not a disk and is missing a descriptor file is removed permanently, since it is a vswap file. If the vswap file is actually associated to a vm, the vm will keep working normally (unless you are swapping, which then you have bigger problems). The vswap file will be recreated once you reboot the vm.

Ok, so time to move. Ready to upgrade…. Maybe not. Ran into another issue after running the same script with precheck option. In this case, the issue was related to disks stuck with CBT (Change Block Tracking) objects. To fix this, simply run the same script but use the fixcbt option instead of the precheck option.

python VsanRealign.py fixcbt


VSAN_fixcbt

VSAN_fixcbt2

 

So at this point, everything looked healthy and ready to go. However; when I tried to do the disk format upgrade yet again, it gave me another error. So this was the fourth error during the upgrade process, luckily this was an easy fix and may not apply to all VSAN environments.

I ran into this with 2 small environments of 3 hosts each. The error stated that I could not upgrade the disk format because there were not enough resources to do so.

A general system error occurred: Failed to evacuate data for disk uuid <XXXX> with error: Out of resources to complete the operation 

To be able to upgrade the disk format to V3, you will need to run the upgrade command from RVC using the option to allow reduced redundancy.

Log in to RVC and run the following command: vsan.ondisk_upgrade –allow-reduced-redundancy

VSAN_Ondisk_Upgrade

 

Each host removes the VSAN disk group(s) from each host and re-adds them on the new format. DO NOT try to do this manually as you will have mismatches that VSAN can’t function properly under. Please follow the recommended procedures.

These steps allowed to upgrade VSAN disk format to V3. It did take quite a while to do this (12 hours), but this was due to me testing all these steps on my lab prior to doing it in production. Yes, the lab had some of the same issues.

After the upgrade was done, I checked the health of the VSAN cluster and noticed a new warning. This warning indicated the need to do a rebalance. So manually running a rebalance job solves the issue.

All good after that…

 

SIDE NOTE:

I did the proper troubleshooting to find out the root cause. The main issue was related to a firmware bug that was causing the servers to not recognize the SD card where vSphere was installed on, and eventually crash. The many crashes across all hosts caused all these object issues within VSAN.

Such issue was related to the iLO version of HP380 G9 servers, running iLO version 2.20 at the time. The fix was to upgrade the iLO to version 2.40 (December 2015) which was the latest version.