Duncan Epping, Consulting Architect, Cloud Practice

Frank Denneman, Consulting Architect, PSO

Duncan and Frank are the authors of the VMware vSphere 4.1 HA and DRS technical deep dive. It is available from Amazon, and from Monday it will also be available at Computer Collectief. You can order it from today from Computer Collectief. The book is definately worth reading. In the session they answered questions from the audience.


In vSphere 4.1 the algorithms for DRS are changed? Can you give some more information on how VMs are distributed over hosts in case of an HA event?

The changes are more in HA, not in DRS itself. in vSphere 4.0 HA checked all host on where to start the VM. This took a lot of time before a VM actually was started. in vSphere 4.1. It also was a big load on the hostd process on the ESX host. in vSphere 4.1 the process is totally different. The VMs are placed across the ESX hosts in the cluster according to a round-robin principal. On the first host HA will check if the portgroup and datastores exist that the VM needs and then it starts the VM. The next VM is getting started on the next ESX host. VMs are started faster and the load on hostd is almost non-existent.

The most common misconception is that HA and DRS are working together. DRS doesn’t do anything after an HA event. Only when the load on an ESX host is getting above the threshold DRS kicks in

Will there be an integration between HA and DRS? What will happen with the next version considering HA?

HA stack is (being) rewritten:

  • all new architecture which results in a single lightweight HA agent process
  • eliminate concepts of “primaries”
  • storage heartbeatin as backup communication channel
  • automatic resolution of network partitions
  • VMs still protected during partitions, no “fighting” for VM control
  • greater scalability, extensible (no more 32 host limit, but more like 200)
  • ability to deal with any number of simultaneous host failures
  • new leightweight communication model
  • all state required to recover from any failure is persistend
  • improved isolution actions (VMs left running and restarted as needed via storage subsystem monitoring)
  • No dependencies on DNS

How about DRS host groups?

In the pre-4.1 versions of vSphere affinity and anti-affinity where used to keep machines together or apart. In 4.1 you can use DRS host groups to have more control over where VMs are placed. For example, you can keep all Oracle servers together on a limited number of ESX hosts. Another example could be splitting DNS servers across blade enclosures. Two kinds of rules exist, must/mandatory rules and should rules. If VMs are in the must rules group they will not be started when HA/DRS cannot keep the VM from breaking the rule. Should rules are more friendly. It will try the obey the rules, but when it can’t, the VM will be started anyway.

How will you design a cluster with a lots of blades?

Primary hosts are used for the HA process. When all primary hosts fail, because they are placed in the same blade enclosure, VMs never start. If using multiple enclosures split clusters across these. If you want more information about primary hosts and HA check  yellow-bricks.com for more information

Why does VMotion ask for a target resource pool?

The answer is that developers aren’t admins. They didn’t realize that in normal operations this action isn’t used. In a next version VMotion doesn’t ask for resource pools.

DRS works together with FT. Why is there a requirement for EVC?

The Yellow Bricks answer according to Frank is: Because it is more efficient. The real reason is that VMware wants to make sure that there is compatibility between hosts for VMotion. By requiring EVC the compatibility between hosts in a cluster is ensured.

What happens when using DPM and DRS?

DPM shuts down the smallest hosts first and then the larger ones to make sure that Also make sure that you have enabled admission control

is there a time window for DRS, for example no DRS during backups?

The discussion is also happening within VMware. The only option now is to disable and enable the rules with PowerShell. You could also configure the aggressiveness of VMotion.

What decides which host is primary node for HA?

This is a random action. It is calculated on different things.