VSAN 6.2 Disk Format Upgrade Fails

error_image I’ve been doing quite a bit of VSAN deployments and upgrades lately. When upgrading up to version 6.1, I did not encounter any issues, luckily. Upgrading VSAN cluster (vSphere) to 6.2 also very smooth; however, while upgrading the disk format from version 2 or 2.5 to version 3, I’ve encountered a few errors. Here are some of the errors I came across.

The first issue was related to inaccessible objects in VSAN.

Cannot upgrade the cluster. Object(s) xxxxx are inaccessible in Virtual SAN.

VSAN_Inaccessible_Obj

This is actually not a new issue.  These inaccessible objects are stranded vswap files that need to be removed. In order to correct this issue, you will need to connect to your vCenter using RVC tools. The RVC command to run is: vsan.purge_inaccessible_vswp_objects 

VSAN_purge_inaccessible

 

The second issue I ran into was related to failed object realignment. Error:

Failed to realign following Virtual SAN objects…. due to being locked or lack of vmdk descriptor file, which requires manual fix.

VMware has acknowledged the issue and has created a python script to correct this issue. The script and instructions can be found on KB2144881

The script needs to be run from the ESXi host shell with the command below, after you have copied the script to a datastore that the host has access to. The script name is VsanRealign.py, but if you rename the file, you will obviously need to use the correct name instead. NOTE: The script takes quite a while to run, so just let it go until it finishes.

python VsanRealign.py precheck

VSAN_realign

Here the script takes care of the descriptor file issue once you answer yes. In this case, the object is not a disk and is missing a descriptor file is removed permanently, since it is a vswap file. If the vswap file is actually associated to a vm, the vm will keep working normally (unless you are swapping, which then you have bigger problems). The vswap file will be recreated once you reboot the vm.

Ok, so time to move. Ready to upgrade…. Maybe not. Ran into another issue after running the same script with precheck option. In this case, the issue was related to disks stuck with CBT (Change Block Tracking) objects. To fix this, simply run the same script but use the fixcbt option instead of the precheck option.

python VsanRealign.py fixcbt


VSAN_fixcbt

VSAN_fixcbt2

 

So at this point, everything looked healthy and ready to go. However; when I tried to do the disk format upgrade yet again, it gave me another error. So this was the fourth error during the upgrade process, luckily this was an easy fix and may not apply to all VSAN environments.

I ran into this with 2 small environments of 3 hosts each. The error stated that I could not upgrade the disk format because there were not enough resources to do so.

A general system error occurred: Failed to evacuate data for disk uuid <XXXX> with error: Out of resources to complete the operation 

To be able to upgrade the disk format to V3, you will need to run the upgrade command from RVC using the option to allow reduced redundancy.

Log in to RVC and run the following command: vsan.ondisk_upgrade –allow-reduced-redundancy

VSAN_Ondisk_Upgrade

 

Each host removes the VSAN disk group(s) from each host and re-adds them on the new format. DO NOT try to do this manually as you will have mismatches that VSAN can’t function properly under. Please follow the recommended procedures.

These steps allowed to upgrade VSAN disk format to V3. It did take quite a while to do this (12 hours), but this was due to me testing all these steps on my lab prior to doing it in production. Yes, the lab had some of the same issues.

After the upgrade was done, I checked the health of the VSAN cluster and noticed a new warning. This warning indicated the need to do a rebalance. So manually running a rebalance job solves the issue.

All good after that…

 

SIDE NOTE:

I did the proper troubleshooting to find out the root cause. The main issue was related to a firmware bug that was causing the servers to not recognize the SD card where vSphere was installed on, and eventually crash. The many crashes across all hosts caused all these object issues within VSAN.

Such issue was related to the iLO version of HP380 G9 servers, running iLO version 2.20 at the time. The fix was to upgrade the iLO to version 2.40 (December 2015) which was the latest version.

 

VMWORLD 2015 PREVIEW

VMworld2015It’s hard to believe it is conference season again. VMworld 2015 is only a month away (or less), and while VMware folks are gearing up to deliver the goods, you should also be getting ready in order to maximize your investment. Conferences are not just about the monetary investment, but also the personal time that we give up to travel across the country to acquire new knowledge.

This is my 6th or 7th VMworld this year (I lost count), so if VMworld 2015 will be your first one, you are in for a treat. I personally like the venue in San Francisco, not only is it familiar for returning attendees but also allows you to explore a great city. Did I mention they have really good breweries close by?!?!

In the United States, the venue will be the same, as I stated before, San Francisco Moscone Center. If you are registered, make sure you also register for the individual sessions. If you do not, you will have to wait outside and see if there is room available after the subscribed attendees have been seated.  Partner and TAM day are held on Sunday, so there is already a good amount of people at the conference center during that weekend. There is also a 5K fun run on Sunday Morning for charity. I participated last year and it was quite fun. More info here.

VMworld 5k Fun Run
VMworld 5k Fun Run

Last year vBrownBag hosted “Opening Acts” which were panel based discussions on many different topics. It is definitely worth your time. Follow vBrownBag on Twitter and visit Opening Acts 2015.

During the conference, there is a lot going on. Sessions from General sessions, deep dives, or just vendor based sessions are just some of many ways to acquire knowledge and professional connections. Make sure you network with other fellow geeks, you never know who you are going to meet, and often times it benefits you in different ways. Talk to the vendors. Yes, not just collect swag, but go talk to them. Many times I’ve been looking for a solution to a business requirements and my google search did not help, then I find out that one of the vendors has developed exactly what I’ve been looking for.

I hope to see many of you in San Francisco very soon…

 

 

Recover your VMs in a Snap with Veeam, NetApp & VMware

recoveryWhen it comes to business continuity and disaster recovery, there are a plethora of options out there. From built-in vendor tools, third party tools and your typical backup/recovery strategy. Choosing a solution does not always result in the best solution, if we take into account the fact that no one person knows all the tools available, and no one has time to research them all. A lot of times tools are chosen based on the vendor’s size, reputation and word of mouth; and of course, a lot of marketing, which I am not a fan of. Every vendor will always say that their solution is the best out there, so don’t rely on marketing material, instead talk to colleagues and other IT individuals through social media, conferences, etc.

Recently I was tasked with providing a BC solution for a specific application. The business was overwhelmed when I presented six different approaches since they really didn’t know what they wanted. After some meetings, I was able to extract what they wanted to achieve, which was a cheap/free solution utilizing current infrastructure that provides granular, and fast recovery of VMs and file level. This sounded like a challenge, but luckily I had just attended a Veeam session at Cisco Live 2015.

The solution selected was Veeam Backup and Replication. This tool allowed for full visibility of existing NetApp volume snapshots of the VMware environment without having to run new backups/snapshots or any additional jobs. Veeam B&R, includes an Explorer tool that connects to both the VMware vCenter and the NetApp arrays. It is then capable of looking inside each volume snapshot, and present the actual VM instead of all individual files. At this point, you have the option to restore the entire VM or even individual files within that object. The great part about this is that if you have a snapshot for exchange server, you are actually able to restore files within that Exchange snap.

Veeam_Explorer

 

Veeam also supports other storage vendors such as HP Lefthand and 3PAR, and new storage vendors will soon be added to this product.

Give it a try, at some point this may be the right solution to a problem or business requirement.

NetApp EVO:RAIL

EVO:RAIL LogoFor those not familiar with EVO:RAIL, this is a great solution from VMware that offers a hyperconverged infrastructure with easy management by leveraging software solutions such as vSphere, vCenter, VSAN and log insight. This solution seems to have the attention of many customers as deployment, and administration are greatly simplified and does not require a high level engineer to maintain the environment.

Although the announcement that NetApp would be launching a hyperconverged EVO:RAIL solution was made late last year, the product has not yet been released to the public (as far as I know). So there are a lot of questions out there. Is it FAS? Is it EVO:RAIL? or a combination? Well, the answer is both and more.

This is a NetApp integrated EVO:RAIL solution that includes both EVO and the NetApp C-DOT we currently know. This offering allows low level admins to administer VMware and NetApp from the same console via VSC. More on VSC, VASA and VAAI here. So in essence, when you get the NetApp 4RU appliance, you use a simple GUI wizard that automatically configures NetApp C-DOT and presents the storage to VMware. So this solution not only virtualizes the compute side but also the storage side.

Being that you have NetApp integrated into this solution, you are still able to use different protocols as well as SAN and NAS offerings, just like we do now with other FAS systems. This solution also includes automated back-up and recovery features, QoS, and Cloud integration by leveraging NetApp Data Fabric.

 

NetApp EVO:RAIL
NetApp EVO:RAIL

So, Why did NetApp decide to jump on the EVO:RAIL bandwagon???

Well, I believe that NetApp recognizes the competition with new storage vendors such as SolidFire, Tintri, and Simplivity among others, that offer all flash, high performance, easy to use hyperconverged solutions. Also NetApp recognizes the need for a solution for small and medium size businesses that do not have the luxury of hiring several IT staff to manage different areas of IT. Lastly, I believe NetApp is recognizing that in order to survive, they need diversify, as the days of shared enterprise storage may be coming to an end by the introduction of new technologies that drive costs down and simplifies administration while reducing overhead.

 

Uses Cases:

NetApp seems to be targeting departments and business areas for this specific solution. In my opinion EVO:RAIL (not just NetApp’s) has many other uses cases such as VDI deployments, production loads for remote offices in different geo-locations, test/dev, as well as DR source/target when combined with a Cloud offering.

I’m curious to see what the final product would be like, and how it would stand against other EVO:RAIL offerings.

 

vSphere 6 Availability Enhancements

With the introduction of vSphere 6, many new enhancements have been introduced. Given that IT is primarily delivered as a service within a business, the availability of our environment is often high priority. This new version of vSphere introduces the following enhancements:

  • Better vMotion Capabilities
  • Multi-Processor Fault Tolerance (FT) (up to 4 vCPUs)
  • App HA now supports more applications
  • vSphere Replication has better RPO (15 minutes) and scalability (2000 VMs)

There are other availability enhancements in vSphere 6, but the previous list really called my attention. Specifically the vMotion capabilities. In previous versions, moving VMs between vCenters was a little cumbersome and required a lot of manual intervention such as scripts or even down time. Such capability is now possible with vSphere where VMs can be moved not only across datacenters, but also across long distances (greater than 100ms round trip time. It is now possible to perform vMotion tasks across virtual switches. However, it is important to understand that the vCenters have to be part of the same SSO domain for this to work.

What does all this mean to me? Well, in my opinion, these enhancements can be extremely handy for disaster prevention exercises. Take a scenario where there is an advanced notice about a hurricane, or flood. Let’s assume that that a stretched VLAN or VXLAN has been configured across 2 data centers with a reasonable rtt (about 100 ms or less). In this case, the option exists to move some powered-on VMs to another vCenter within the same subnet in order to prevent down time for the business. Of course, this can also be accomplished by SRM if already implemented.

These enhancements as well as the ones in the network, managements, and storage realms makes vSphere 6 impossible to ignore, and set VMware apart from its competitors.