Troubleshooting vSphere PSOD

VMware_PSOD The Screen of Death, as most of us know it as, is the result of a system crash. Windows has his famous Blue Screen of Death (BSOD), and VMware has a purple screen of death (PSOD). Of course there is also a Black Screen of Death, which is usually when Windows systems are missing a boot file or one or more of those files have become corrupted. Although there is a range of colors, the problem for many is How do I fix this? How do I know what caused this?

Many admins start with the obvious and simply reboot the machine hoping it was a hiccup, but chances are, there is a bigger problem going on that needs addressed. In VMware, just like other systems, a core dump file is created when the stop error is generated. This is where you start digging…

Where is my DUMP…file?!?

So, during the purple screen, the host is writing the dump file to a previously created partition called VMKcore. There is a chance that the core dump file won’t be written due to internal problems, so it is always a good idea to take a screen shot of the PSOD. Exporting the core dump file can be done via CLI, manually from vCenter path for both Windows and/or appliance, as well as vSphere Client and WebClient; which is the preferred method from most admin since it is so simple to do.

To export the logs from vSphere Web Client, use the following steps:

  • Open vSphere Web Client > Hosts & Clusters > Right click on vCenter > Export System Logs…

Sys_Logs

  • Choose the host that had the PSOD > Next

Sys_Logs_ESXi

  • Make sure you select CrashDumps, all others are optional

Sys_Logs_CrashDump

 

Once you have the dump file (vmkernel-zdump….), its time to look for the needle in the haystack. There are a lot of entries, and this file can be overwhelming to many people, but don’t stress, it is quite simple to find it. The first logical step is to find the crash entry point You can use the time when you noticed the PSOD or you can simply search within the log file for “@bluescreen”.

Find_@Bluescreen

Once you find this, you will see the exact cause for the PSOD. In the screenshot below, you can see that the error generated is in relation to E1000. You should automatically think vNIC/Drivers, as well as looking online for any VMware KB articles regarding the errors generated. In this case, there is a known issue for different versions of vSphere that have already been patched; so keeping up to date on patches is very important.

E1000_PSOD

 

The issue that triggered the PSOD in this environment was related to updates (fix) not being applied. The work around was to not use E1000e NIC on the VM but rather VMXNET3. Also, you HAVE to install the VMTools on your VMs. The VMTools have drivers needed for your VM to work properly. In this particular instance, VMTools were not installed on the VM. Once the tools were installed and the vNIC was switch to VMXNET3, the issue was resolved.

 

Refer to VMware’s KB2059053 for more info.

One thought on “Troubleshooting vSphere PSOD

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s