VSAN Proactive Rebalance

balance1There has been a lot of questions as to what happens when a rebalance task is triggered in VSAN. By default, VSAN will try to do a proactive rebalance of the objects as the disks start hitting certain thresholds (80%). There are instances, during failures/rebuilds, or even when organic imbalance is discovered, where administrators may trigger a proactive rebalance task.

What happens

Once you click on the “balance disks” button. You are opening a 24-hr window where rebalance will take place. This means that the rebalance operation may take up to 24 hours, so be patient. Many people have voiced frustration because the UI shows a 5% progress (or lack there of) for a very long time, almost appearing as it is stuck. The rebalance is taking place on the background.

You may also not see any progress at all for the first 30 minutes. This is because VSAN wants to wait to make sure that the imbalance persists before it attempts to move any objects around. After all, the rebalance task is moving objects between disks/nodes, so copying data over the network will take resources, bandwidth and time; so plan accordingly if you must rebalance.

Background Tasks:

  • Task at 1 percent when created.
  • Task at 5 percent when rebalance command is triggered.
  • Then waits for the rebalance to complete before setting the percent done to 100.
    • During the waiting period, it will check to see if rebalance is done (clom-tool command).
    • If not done, it will sleep for 100 seconds and check again if rebalance is done.

By default when triggered from the VC UI, the task will run for 24 hours or whenever the rebalance effort is done, whichever comes first.

Notice that if your disks are balanced, the button is greyed out to avoid unnecessary object “shuffling”.

rebalance

 

VSAN 6.2 Disk Format Upgrade Fails

error_image I’ve been doing quite a bit of VSAN deployments and upgrades lately. When upgrading up to version 6.1, I did not encounter any issues, luckily. Upgrading VSAN cluster (vSphere) to 6.2 also very smooth; however, while upgrading the disk format from version 2 or 2.5 to version 3, I’ve encountered a few errors. Here are some of the errors I came across.

The first issue was related to inaccessible objects in VSAN.

Cannot upgrade the cluster. Object(s) xxxxx are inaccessible in Virtual SAN.

VSAN_Inaccessible_Obj

This is actually not a new issue.  These inaccessible objects are stranded vswap files that need to be removed. In order to correct this issue, you will need to connect to your vCenter using RVC tools. The RVC command to run is: vsan.purge_inaccessible_vswp_objects 

VSAN_purge_inaccessible

 

The second issue I ran into was related to failed object realignment. Error:

Failed to realign following Virtual SAN objects…. due to being locked or lack of vmdk descriptor file, which requires manual fix.

VMware has acknowledged the issue and has created a python script to correct this issue. The script and instructions can be found on KB2144881

The script needs to be run from the ESXi host shell with the command below, after you have copied the script to a datastore that the host has access to. The script name is VsanRealign.py, but if you rename the file, you will obviously need to use the correct name instead. NOTE: The script takes quite a while to run, so just let it go until it finishes.

python VsanRealign.py precheck

VSAN_realign

Here the script takes care of the descriptor file issue once you answer yes. In this case, the object is not a disk and is missing a descriptor file is removed permanently, since it is a vswap file. If the vswap file is actually associated to a vm, the vm will keep working normally (unless you are swapping, which then you have bigger problems). The vswap file will be recreated once you reboot the vm.

Ok, so time to move. Ready to upgrade…. Maybe not. Ran into another issue after running the same script with precheck option. In this case, the issue was related to disks stuck with CBT (Change Block Tracking) objects. To fix this, simply run the same script but use the fixcbt option instead of the precheck option.

python VsanRealign.py fixcbt


VSAN_fixcbt

VSAN_fixcbt2

 

So at this point, everything looked healthy and ready to go. However; when I tried to do the disk format upgrade yet again, it gave me another error. So this was the fourth error during the upgrade process, luckily this was an easy fix and may not apply to all VSAN environments.

I ran into this with 2 small environments of 3 hosts each. The error stated that I could not upgrade the disk format because there were not enough resources to do so.

A general system error occurred: Failed to evacuate data for disk uuid <XXXX> with error: Out of resources to complete the operation 

To be able to upgrade the disk format to V3, you will need to run the upgrade command from RVC using the option to allow reduced redundancy.

Log in to RVC and run the following command: vsan.ondisk_upgrade –allow-reduced-redundancy

VSAN_Ondisk_Upgrade

 

Each host removes the VSAN disk group(s) from each host and re-adds them on the new format. DO NOT try to do this manually as you will have mismatches that VSAN can’t function properly under. Please follow the recommended procedures.

These steps allowed to upgrade VSAN disk format to V3. It did take quite a while to do this (12 hours), but this was due to me testing all these steps on my lab prior to doing it in production. Yes, the lab had some of the same issues.

After the upgrade was done, I checked the health of the VSAN cluster and noticed a new warning. This warning indicated the need to do a rebalance. So manually running a rebalance job solves the issue.

All good after that…

 

SIDE NOTE:

I did the proper troubleshooting to find out the root cause. The main issue was related to a firmware bug that was causing the servers to not recognize the SD card where vSphere was installed on, and eventually crash. The many crashes across all hosts caused all these object issues within VSAN.

Such issue was related to the iLO version of HP380 G9 servers, running iLO version 2.20 at the time. The fix was to upgrade the iLO to version 2.40 (December 2015) which was the latest version.

 

VSAN – Part 3

So, I am anxiously waiting for hardware to arrive in order to make my hosts VSAN compatible. I ordered 32GB SD Cards so I can install vSphere on, 1 SSD drive per host and (1) 4-port CNA for each host for the 10GbE interfaces for VSAN traffic. While I am waiting for the hardware, I decided to make some time now to review my design rather than making time later to do things over again. In search of further knowledge, I tuned in to listen to my compadre Rawlinson Rivera (VMware) speak about VSAN best practices and use cases. I was very happy to learn that I am on the right path; however, I found a couple of areas where I can tweak my design. In part Deux (2) of the VSAN topic, I spoke about using 6 magnetic disk drives and one SSD drive per host. There is nothing wrong with this design, BUT Rawlinson brought up an excellent point; when thinking about VSAN, think wide for scalability and keeping failure domain in mind.

One thing I did not talk about is disk groups. Each host can have up to 5 disk groups, and 7 magnetic drives + 1 SSD. There is no way I can put 40 drives in a 1U HP DL360p, but I can however, create several disk groups. At least 2 of them. So, I am modifying my plan to have 2 disk groups of 3 magnetic disk drives + 1 SSD per group, per host. This will allow me to sustain an SSD failure within a host without affecting too many VMs in that host, given that I have another disk group with its own SSD that is not being affected by the other SSD failure. So, if you think about it, 1 disk group means that if the only SSD drive fails it affects all the other drives. If you have 2 disk groups and one SSD drive fails, it only affects the drives in that group and that is it. So, by breaking down my 6 magnetic drives into 2 disk groups, I’m reducing the probability of failure in half. The only caveat is that I will need to buy an extra SSD drive per host.

On the network side, besides having 10GbE connections, I’ll need to enable multicast on the switches in order for VSAN to work correctly for heartbeats and other VSAN related communication. My plan is to use SFPs on the Cisco 3750x switches and create a separate VLAN for those ports for VSAN traffic purposes, then disable IGMP snooping on that VLAN to allow for multicast packets for all ports on that VLAN. Actually now that I think about it, I will use TwinAx cables instead of SFP+ and fiber, SFPs are too expensive and I’m trying to keep costs down. This is an example about disabling IGMP snooping on a specific VLAN(5) rather than globally.

switch# config t
switch(config)# vlan 5
switch(config-vlan)# no ip igmp snooping

It is also recommended to use Network IO control (NIOC). Some of you may be saying, well that requires a VDS and I don’t have that version of vSphere. And I would say; well luckily for you, VSAN includes VDS for you to use regardless of the vSphere flavor you are running. Isn’t VSAN awesome?!?!

Other things to consider. No support for IPv6, bummer… not really. NIC teaming is also encouraged, but you should do more research on the different scenarios, based on the amount of uplinks you will be using. Remember PPPPPPP.

I hope by now you are starting to see how powerful VSAN is. This is not VSA in steroids, this is a whole new way of using commodity hardware to achieve performance and redundancy at a relatively low cost.

VSAN – Part Deux

So now it is time to get our hands dirty. When I do a project, I usually live by the 7Ps. What’s that? 7Ps stand for Proper Prior Planning Prevents “Pitiful” Poor Performance… yes I cleaned it up a little. But anyway, I believe that attention to detail is important during the planning phase of any project. You don’t want to buy servers for VSAN just to find out during installation that the hardware is not a supported configuration.

Since I started going down the hardware specs, let’s dig a little deeper. First off, you must be familiar with the requirements for VSAN.

  • At least one flash device (SSD drive) per host
  • At least one hard disk per host
  • A boot device – Could be either USB, SD Card or hard disk. However, if using a hard disk, you cannot assign this device to the VSAN datastore. I personally prefer using USB or SD card.
    • If using USB or SD flash card, it needs to be at least 8GB. This installation is not supported if hosts have over 512GB of memory. More info on USB installation here.
  • Host storage controller must be capable of pass-through mode or JBOD. This means, be able to leave your drives without RAID.
  • At least one dedicated 1 GB NIC for VSAN per host. 10 GB is recommended.
  • At least 3 hosts for the VSAN cluster. Yep, that’s correct, for redundancy purposes.
  • At least 6GB of memory per host.

Apart from this list, make sure you check the VSAN HCL here.

Software requirements are just as important as hardware requirements for VSAN.

  • vCenter server is required
    • Can be Windows-based or Appliance (VCSA)
    • vCenter Server version should be 5.5 U1 or later
  • ESXi 5.5 U1 is required at a minimum
  • Obviously, you’ll need the proper licenses and support.

Side note: There are many VSAN ready solutions from different vendors, check this list. This will save you some time rather than building your own solution. In my case, I want to re-use the existing hardware where my VSA solution is running on, so I’ll be upgrading/adding some hardware.

Back to the upgrade.

I am running VSA on 2 HP DL360p G8s with 6 hard drives each, so I bought a third server identical to the other two. I’m basically running a VSAN ready solution that I put together based on the VMware compatibility guide. During my planning phase, I realized that I needed to obtain SSD drives. So I just went to the local store and grabbed a few… Just joking. Of course I checked the HCL!!!

After checking the HCL and compatibility matrix, I found a few options. Not necessarily the cheapest, so in order to keep cost down I purchased 3 SSD drives (1 per host). I also found out that the USB drives I have VSA installed on are less than 8GBs, so I’m buying an SD flash card for each host to install ESXi on.

So that is pretty much it for my hardware planning. I checked the storage adapter, and the DL360p G8s I have, came with the Smart Array P420i controller, so I’m good there. Oh wait, I only have 1GB NICs and those will be used for prod traffic. Looks like I’ll be buying NICs as well, luckily those are cheap. That should it for hardware, and I have all the licenses and support needed.

I’m going to set up all the hardware up and will be right back with Part 3.

VSA – EOL (VSAN Part 1)

Good Bye, VSA

As some of you may know, VMware VSA has gone end of life as of April, 2014. This applies to all flavors of VMware’s vSphere Storage Appliance. This does not mean that you have to stop using VSA… at least not immediately.

As long as you have an active support contract for VMware’s VSA, you are still entitled to contact support regarding any issues. However, you can no longer purchase VSA. So what exactly do this mean for those of us that have VSA in our environments? Now what?

While there is no replacement per se, there is however a better alternative. Yes, I said better. VMware engineers took the concept of VSA and created a much more robust solution. The new and improved solution is called VSAN or Virtual SAN. I am not saying by any means that VSAN is built on VSA code, I am saying that they share common use cases and solutions. Last I heard, VMware was coming up with a SKU for VSA to VSAN upgrade (I’m checking my sources).

VSAN is VMware’s software-design storage solution that allows for the use of local storage and leverages such storage not only for capacity purposes, but also for performance gains. VSAN caches reads/writes by utilizing server-side flash. There are tons of documents about VSAN and their use cases. I have included a few links from other blogs about VSAN.

So, while there is no direct/in-place upgrade available, there is a way to take your existing hardware, as long as your hardware meets the VSAN Hardware requirements, and transform it into your new VSAN solution. I’m in the process of doing this myself right now, so I will post steps as I go through this process.

 

In the meantime, you may want to check these links out:

 

To be continued…