Today I got another question about several Windows VMs breaking down and generating lots of errors after a SAN switch failed. The active paths switched through a path failover. I have seen this behavior before with several customers who use SAN storage and Windows for guest operating system in the VM. I sure had to dig around in my memory which setting to switch to 60 seconds. So I thought I would make this blog post for future references.

Path failover refers to situations when the active path to a LUN is changed from one path to another, usually because of some SAN component failure along the current path. A server usually has one or two HBAs and each HBA is connected to one or two storage processors on a given SAN array. You can determine the active path, the path currently used by the server, by looking at the LUN properties.

When an FC cable is pulled, I/O might pause for 30-60 seconds until the FC driver determines that the link is unavailable and failover has occurred. As a result, the virtual machines, with their virtual disks installed on SAN storage, can appear unresponsive. If you attempt to display the host, its storage devices, or its adapter, the operation might appear to stall. After failover is complete, I/O resumes normally.

In case of disastrous events that include multiple breakages, all connections to SAN storage devices might be lost. If none of the connections to the storage device is working, some virtual machines might encounter I/O errors on their virtual SCSI disks.

What you can do to avoid this behavior is to increase the Standard Disk Timeout value on a Windows guest operating system to avoid disruptions during a path failover, heavy back-up or vss snapshot occurring on a specific LUN or VMDK (residing on a SAN).

The procedure below is taken from the vSphere 5 Documentation Center which resource is also very handy as an online resource for lots of vSphere 5 related stuff.

Set Timeout on Windows Guest OS This procedure explains how to change the timeout value by using the Windows registry. Increase the standard disk timeout value on a Windows guest operating system to avoid disruptions during a path failover.

Prerequisites

Back up the Windows registry before you are going to change values mentioned below.

Procedure

  1. Select Start > Run.
  2. Type regedit.exe, and click OK.
  3. In the left-panel hierarchy view, double-click HKEY_LOCAL_MACHINE > System > CurrentControlSet > Services > Disk.
  4. Double-click TimeOutValue.
  5. Set the value data to 0x3c (hexadecimal) or 60 (decimal) and click OK.
  6. Reboot guest OS for the change to take effect.

After you make this change, Windows waits at least 60 seconds for delayed disk operations to complete before it generates errors.

Automating the setting

Jase McCarty from Jase’s Place has some nice automation done around the issue, see also:

  • Set Guest OS I/O Timeout Settings via PowerShell
  • Set Guest OS I/O Timeout Settings via Reg.exe