I’ve been doing quite a bit of VSAN deployments and upgrades lately. When upgrading up to version 6.1, I did not encounter any issues, luckily. Upgrading VSAN cluster (vSphere) to 6.2 also very smooth; however, upgrading the disk format from version 2 or 2.5 to version 3 has been a daunting task so far. I have an 80% failure rate on this type of upgrade. Here are some of the errors I came across.
The first issue was related to inaccessible objects in VSAN.
Cannot upgrade the cluster. Object(s) xxxxx are inaccessible in Virtual SAN.
This is actually not a new issue. These inaccessible objects are stranded vswap files that need to be removed. In order to correct this issue, you will need to connect to your vCenter using RVC tools. The RVC command to run is: vsan.purge_inaccessible_vswp_objects
The second issue I ran into was related to failed object realignment. Error:
Failed to realign following Virtual SAN objects…. due to being locked or lack of vmdk descriptor file, which requires manual fix.
VMware has acknowledged the issue and has created a python script to correct this issue. The script and instructions can be found on KB2144881
The script needs to be run from the ESXi host shell with the command below, after you have copied the script to a datastore that the host has access to. The script name is VsanRealign.py, but if you rename the file, you will obviously need to use the correct name instead. NOTE: The script takes quite a while to run, so just let it go until it finishes.
python VsanRealign.py precheck
Here the script takes care of the descriptor file issue once you answer yes. In this case, the object is not a disk and is missing a descriptor file is removed permanently, since it is a vswap file. If the vswap file is actually associated to a vm, the vm will keep working normally (unless you are swapping, which then you have bigger problems). The vswap file will be recreated once you reboot the vm.
Ok, so time to move. Ready to upgrade…. Maybe not. Ran into another issue after running the same script with precheck option. In this case, the issue was related to disks stuck with CBT (Change Block Tracking) objects. To fix this, simply run the same script but use the fixcbt option instead of the precheck option.
python VsanRealign.py fixcbt
So at this point, everything looked healthy and ready to go. However; when I tried to do the disk format upgrade yet again, it gave me another error. So this was the fourth error during the upgrade process, luckily this was an easy fix and may not apply to all VSAN environments.
I ran into this with 2 small environments of 3 hosts each. The error stated that I could not upgrade the disk format because there were not enough resources to do so.
A general system error occurred: Failed to evacuate data for disk uuid <XXXX> with error: Out of resources to complete the operation
To be able to upgrade the disk format to V3, you will need to run the upgrade command from RVC using the option to allow reduced redundancy.
Log in to RVC and run the following command: vsan.ondisk_upgrade –allow-reduced-redundancy
Each host removes the VSAN disk group(s) from each host and re-adds them on the new format. DO NOT try to do this manually as you will have mismatches that VSAN can’t function properly under. Please follow the recommended procedures.
These steps allowed to upgrade VSAN disk format to V3. It did take quite a while to do this (12 hours), but this was due to me testing all these steps on my lab prior to doing it in production. Yes, the lab had some of the same issues.
After the upgrade was done, I checked the health of the VSAN cluster and noticed a new warning. This warning indicated the need to do a rebalance. So manually running a rebalance job solves the issue.
All good after that…
I did the proper troubleshooting to find out the root cause. The main issue was related to a firmware bug that was causing the servers to not recognize the SD card where vSphere was installed on, and eventually crash. The many crashes across all hosts caused all these object issues within VSAN.
Such issue was related to the iLO version of HP380 G9 servers, running iLO version 2.20 at the time. The fix was to upgrade the iLO to version 2.40 (December 2015) which was the latest version.