Know thy Hypervisor’s limits: VMware HA, DRS and DPM

Last week I was setting up a vSphere Cluster and like any good admin, I was test driving all its features, making sure everything was working fine. As a side note, I’m trying to squeeze as much values of the vSphere Licenses we currently have, so I’ve set this cluster up with lots of the bells and whistles ESXi has, like:

  • Distributed virtual Switches
  • SDRS Datastore Clusters
  • NetIOC and SIOC
  • VMware HA, DRS and Distributed Power Management

(In v5.1 a lot of them have gotten better, more mature, less buggy)

So here I was, had this 5 host cluster  (ESXi1 to 5) setup nicely and I was testing VMware’s HA, but in slightly different scenario, one when DPM said “hey power down these 3 hosts you don’t need them, you have enough capacity”. Fine…”Apply DRS Recommendation” I did, hosts went to standby.

So there I had my 201  dual core test VMs running on just 2 servers (mind you this cluster is just for testing, so VMs were mostly idle). Time to do some damage:

Let me tell you what happens if you have one of the remaining blades go down, say ESXi1:

  1. First HA kicks in and notices, ESXi1 is not responding to hearbeats via IP  or Storage Networks, so that means host is down, not isolated.
  2. HA figures out ESXi1 had VMs running on it that were protected and starts powering ON VMs. Also DRS will eventually figure out you need more capacity and start powering on some of your standby hosts.
  3. HA will power on all your VMs just fine, by the time it finishes, DRS still had not onlined some standby hosts….so I ended up with 1 VM that HA did not manage to power on, bummer!

I investigated this issue. First stop – event viewer on the surviving host had this to say

“vSphere HA unsuccessfully failed over TestVM007. vSphere HA will retry if the maximum number of attempts has not been exceeded.

Reason: Failed to power on VM. warning

6/27/2013 1:39:00 PM
TestVM007″

Related events also showed this information:

“Failed to power on VM.
Could not power on VM : Number of running VCPUs limit exceeded.
Max VCPUs limit reached: 400 (7781 worlds)
SharedArea: Unable to find ‘vmkCrossProfShared’ in SHARED_PER_VCPU_VMX area.
SharedArea: Unable to find ‘nmiShared’ in SHARED_PER_VCPU_VMX area.
SharedArea: Unable to find ‘testSharedAreaPtr’ in SHARED_PER_VM_VMX area.”

I then went straight to the ESXi 5.1 configuration maximums…for sure an ESXi host can take more than 400 vCPUs right? And there it was, page 2:

“Virtual CPUs per host 2048

Ok…I’m at 400 nowhere near that limit. Then I find this KB article from vmware… no help since it seems to apply to ESXi 4.x not 5.1. Also you can’t find the advanced configuration item they mention in the paper. I looked for it using web client, vsphere client and powerCLI. It’s not there. However you do see the value listed if you run this from an ssh session on the target host….but I suspect it is not configurable:

# esxcli system settings kernel list | grep maxVCPUsPerCore

maxVCPUsPerCore uint32 Max number of VCPUs should run on a single core. 0 == determine at runtime 0 0 0

I go back to the maximums document and read also on page 2:

“Virtual CPUs per core 25

OK…So the message said 400vCPU limit reached:

400/25 = 16 – the exact number of cores I have on my ESXi boxes.

Eureka-Moment

So kids, I managed to reach one of ESXi’s limits with my configuration. Which makes you wonder a little about running high density ESXi hosts…and VMWARE’s claim that they can run 2000 vCPUs…sure they can, if you can run 40 physical ones in one box 🙂

My hosts had 16 pCPUs and 192GB  RAM and half the RAM slots were still empty so I could in theory double down the RAM and  stuff each ESXi server with 200VMs each with 2vCPUs and 1-2GB RAM…and I would not be able to failover in case of failure and other scenarios.

Where else does vCPU to pCPU limit manifest itself?

I also tried to see exactly what other scenarios might cause vCenter and ESXi to act up and misbehave. Here’s what I’ve got sofar:

Scenario A: 5 host DRS cluster, with VMware HA enabled, percentage based failover, and admission control enabled. 201 VMs running, 2vCPUs each:

  • Put each host in maintenance mode, until DRS stop migrating VMs for you and enter maintenance mode is stuck at 2%. Note that no VMs are migrated from the evacuated host, but this is due to Admission control, not the hitting vCPU ceiling.
  • Once you reach that point DRS will issue a fault, that it has no more resources, and will not initiate a single vMotion. The error I got was

Insufficient resources to satisfy configured failover level for vSphere HA.

Migrate TestVM007 from ESXi1.contoso.com to any host”

Scenario B: 5 Host DRS cluster, with VMware HA enabled, percentage based failover, and admission control disabled. 401 VMs running, 2vCPUs each:

  • Put each host in maintenance mode, until DRS stop migrating VMs for you and enter maintenance mode is stuck at 2%. Note that VMs are migrated from the evacuated host, but only when you hit the 400vCPU limit do the vMotions stop.
  • The error you get is:

“The VM failed to resume on the destination during early power on.

Failed to power on VM.
Could not power on VM : Number of running VCPUs limit exceeded.
Max VCPUs limit reached: 400 (7781 worlds)
SharedArea: Unable to find ‘vmkCrossProfShared’ in SHARED_PER_VCPU_VMX area.
SharedArea: Unable to find ‘nmiShared’ in SHARED_PER_VCPU_VMX area.
SharedArea: Unable to find ‘testSharedAreaPtr’ in SHARED_PER_VM_VMX area.”

So a similar message to the one you get when HA can’t power on a VM.

To wrap this up, I think there might be some corner cases where people might start to see this behaviour (I’m thinking in VDI environments mostly), and it would be very wise to take a serious look at the vCPU : pCPU ratio in failover scenarios to avoid hitting vSphere ESXi’s maximum values.

vSphere Web Client 5.1 on Windows Server 2012 not starting up – Adobe Flash Player unavailable

I’ve been doing some work the past week around setting up a new vSphere based Datacenter for a customer and I’ve found out that working with the latest and greatest versions of Windows and vCenter isn’t always working without a hitch, as it should. For example if you try to run the vSphere client on Windows Server 2012, you will have to jump trough quite a number of hoops to get it to run…but wait…vSphere Client is deprecated as of version 5.0 and although I still think it is better than the web client in this day, June 2013, the way forward is the web client, and we will not be able to access 5.x features from the vSphere client from now on….so we better get with the program and work on using the vSphere Web Client only.

First step when you try to run the client, is first connect directly to the web page of vcenter, let’s say https://vcenter.contoso.com. If you are on the default Windows Server 2012 install, with IE Enhanced Security Configuration enabled, you will get a broken web page like the one below:

vCenter-Broken-Page

There’s an easy fix to this:

  • First disable IE Enhanced Security Configuration from the Server Manager GUI. Do not disable IE ESC if this is a server running production workloads unless you understand and accept the risks involved.
  • Then simply go to Internet Options > Security > Local Intranet > Sites > add your vCenter URL, click OK to close all Windows and then reload your page.

After that the web page opens up correctly, you click the “Log in to vSphere Web Client” link…And surprise, you get this message:

vSphere-webClient-broken

Isn’t this great…you now need Adobe Flash Player installed on the system. I should have known better, VMWare’s web client is built using Adobe Flex technology, so you do need Flash Player to run it.

So, happily you click on the “Get Adobe Flash Player” link, and you are redirected to Adobe’s page that says…”Hey you are running Windows 8, you should have Flash Player installed”…well Actually I’m running Windows 2012, so things might be a little more locked down than that. I followed the troubleshooting FAQ from Adobe’s website only to find out I didn’t have the plugin installed, as I said, this is Windows Server 2012, not all the bells must be turned on.

I did a little digging and as it turns out, on Windows 2012, there is such a thing called “Desktop Experience”, a Windows Feature. Among the things this Windows feature will bring, this webpage, lists “Adobe Flash Player”  in the miscellaneous section.  That webpage is also going to guide you through the GUI based installation of the Desktop Experience, it’s basically click click, next next until you get the damn thing installed.

If you want to do it via Powershell, here’s the one liner that will help you out. Either way you choose to install it, you will need a reboot once it is completed:

import-module ServerManager
Add-WindowsFeature -name Desktop-Experience,qWave `
-IncludeAllSubFeature -IncludeManagementTools

The qWave feature is Quality Windows Audio-Video Experience – this just in case you were wondering what that is doing there…someone on MSDN suggested to add this to Desktop Experience…so I did, you can leave it out.

After rebooting the box go to the Web Client page, and yo and behold the webpage will load  correctly. If this is your first time loading up the Web Client, after successful authentication you will get one error and a message from Adobe Flash.

The error is something like “Web client has encountered an error #Some Number“. On top of it comes Adobe Flash with the message that asks for permission that the webpage store more than 1MB of flash data on your hard drive. Click Accept, then on the error message choose to reload the client, and all will be well with the VMware world again, you can manage your infrastructure without other annoyances.

Hope this helps others out there, let me know if this worked for you or not

Report DHCP Scope Settings using Powershell

It has been a busy time for me lately, but I’m back here to write about a script to Report on some basic DHCP scope settings. In my situation I used this script to find out which DHCP scopes had specific DNS servers configured, DNS servers that we planned to decommission, so it made sense to replace the IP addresses with valid ones.

keep-calm-and-import-module-dhcpserver

 I found myself lately working more and more with the Powershell V3, available in Windows  Server 2012, and the new “goodies” it brings.

Among those goodies there’s a DHCPServer module, so we can finally breathe a sigh of relief, we can dump netsh and any VBS kludges used to manage DHCP!*

(* lovely as this module is, you cannot use it fully against  Windows 2003 Server, some cmdlets will work, others, not so much, so windows 2008 or later it is)

For an overview of what commandlets are available in this new module take a look on the Technet Blogs. To get started simply deploy a Windows 2012 machine and open Powershell, then type:

import-module DhcpServer

While you are at it update help files for all your Powershell module with this command:

Update-Help –Module * –Force –Verbose

Mission Statement

I needed a report that would contain following Info: DHCPServer name, Scope Name, Subnet defined, Start and End Ranges, Lease Times, Description, DNS Servers configured, globally or explicitly defined. As you can imagine, collating all this information from netsh, vbs, or other parsing methods would be kind of time consuming. Also i’m aware there are DHCP modules out there for Powershell but personally I prefer to use a vendor supported developed method, even if it takes more effort to put together / understand (you never know when a Powershell module from someone starts going out of date, for whatever reason and all your work in scripting with them is redundant).

The Script

Anyway, I threw this script together, which isn’t much in itself, apart from the  error handling that goes on. As I mentioned before, the DhcpServer module doesn’t work 100% unless you are running Windows 2008 or later.

import-module DHCPServer
#Get all Authorized DCs from AD configuration
$DHCPs = Get-DhcpServerInDC
$filename = "c:\temp\AD\DHCPScopes_DNS_$(get-date -Uformat "%Y%m%d-%H%M%S").csv"

$Report = @()
$k = $null
write-host -foregroundcolor Green "`n`n`n`n`n`n`n`n`n"
foreach ($dhcp in $DHCPs) {
	$k++
	Write-Progress -activity "Getting DHCP scopes:" -status "Percent Done: " `
	-PercentComplete (($k / $DHCPs.Count)  * 100) -CurrentOperation "Now processing $($dhcp.DNSName)"
    $scopes = $null
	$scopes = (Get-DhcpServerv4Scope -ComputerName $dhcp.DNSName -ErrorAction:SilentlyContinue)
    If ($scopes -ne $null) {
        #getting global DNS settings, in case scopes are configured to inherit these settings
        $GlobalDNSList = $null
        $GlobalDNSList = (Get-DhcpServerv4OptionValue -OptionId 6 -ComputerName $dhcp.DNSName -ErrorAction:SilentlyContinue).Value
		$scopes | % {
			$row = "" | select Hostname,ScopeID,SubnetMask,Name,State,StartRange,EndRange,LeaseDuration,Description,DNS1,DNS2,DNS3,GDNS1,GDNS2,GDNS3
			$row.Hostname = $dhcp.DNSName
			$row.ScopeID = $_.ScopeID
			$row.SubnetMask = $_.SubnetMask
			$row.Name = $_.Name
			$row.State = $_.State
			$row.StartRange = $_.StartRange
			$row.EndRange = $_.EndRange
			$row.LeaseDuration = $_.LeaseDuration
			$row.Description = $_.Description
            $ScopeDNSList = $null
            $ScopeDNSList = (Get-DhcpServerv4OptionValue -OptionId 6 -ScopeID $_.ScopeId -ComputerName $dhcp.DNSName -ErrorAction:SilentlyContinue).Value
            #write-host "Q: Use global scopes?: A: $(($ScopeDNSList -eq $null) -and ($GlobalDNSList -ne $null))"
            If (($ScopeDNSList -eq $null) -and ($GlobalDNSList -ne $null)) {
                $row.GDNS1 = $GlobalDNSList[0]
                $row.GDNS2 = $GlobalDNSList[1]
                $row.GDNS3 = $GlobalDNSList[2]
                $row.DNS1 = $GlobalDNSList[0]
                $row.DNS2 = $GlobalDNSList[1]
                $row.DNS3 = $GlobalDNSList[2]
                }
            Else {
                $row.DNS1 = $ScopeDNSList[0]
                $row.DNS2 = $ScopeDNSList[1]
                $row.DNS3 = $ScopeDNSList[2]
                }
			$Report += $row
			}
		}
	Else {
        write-host -foregroundcolor Yellow """$($dhcp.DNSName)"" is either running Windows 2003, or is somehow not responding to querries. Adding to report as blank"
		$row = "" | select Hostname,ScopeID,SubnetMask,Name,State,StartRange,EndRange,LeaseDuration,Description,DNS1,DNS2,DNS3,GDNS1,GDNS2,GDNS3
		$row.Hostname = $dhcp.DNSName
		$Report += $row
		}
	write-host -foregroundcolor Green "Done Processing ""$($dhcp.DNSName)"""
	}

$Report  | Export-csv -NoTypeInformation -UseCulture $filename

Learning Points

As far as learning points go, Get-DHCPServerInDC lets you grab all your authorized DHCP servers in one swift line, saved me a few lines of coding against the Powershell AD module.

Get-DhcpServerv4Scope will grab all IPv4 server scopes, nothing fancy, except for the fact, that it doesn’t really honor the “ErrorAction:SilentlyContinue” switch and light up your console when you run the script.

Get-DhcpServerv4OptionValue can get scope options, either globally (do not specify a ScopeID) or on a per scope basis by specifying a scopeID. This one does play nice and gives no output when you ask it to SilentlyContinue.

Some Error Messages

I’ve tested a script in my lab, and used in production, it works fine for my environment, but do you own testing.

Unfortunately, the output is not so nice and clean you do get errors, but the script rolls over them, below are a couple of them I’ve seen. First one is like this:

Get-DhcpServerv4Scope : Failed to get version of the DHCP server dc1.contoso.com.
At C:\Scripts\Get-DHCP-Scopes-2012.ps1:14 char:13
+ $scopes = (Get-DhcpServerv4Scope -ComputerName $dhcp.DNSName -ErrorAction:Silen ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 + CategoryInfo : NotSpecified: (dc1.contoso.com:root/Microsoft/...cpServerv4Scope) [Get-DhcpServerv4Scope], CimException
 + FullyQualifiedErrorId : WIN32 1753,Get-DhcpServerv4Scope

This actually happens because the Get-DhcpServerv4Scope has a subroutine to check the DHCP server version, which fails. As you can see my code does have Silentlycontinue to ommit the error, but it still shows up. I dug up the 1753 error code, and the error message is “There are no more endpoints available from the endpoint mapper“…which is I guess a Powershell way of telling us, Windows 2003 is not supported. This is what we get for playing with v1 of this module.

Another error I’ve seen is this:

Get-DhcpServerv4Scope : Failed to enumerate scopes on DHCP server dc1.contoso.com.
At C:\Scripts\Get-DHCP-Scopes-2012.ps1:14 char:13
+ $scopes = (Get-DhcpServerv4Scope -ComputerName $dhcp.DNSName -ErrorAction:Silen ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 + CategoryInfo : PermissionDenied: (dc1.contoso.com:root/Microsoft/...cpServerv4Scope) [Get-DhcpServerv4Scope], CimException
 + FullyQualifiedErrorId : WIN32 5,Get-DhcpServerv4Scope

It is just a plain old permission denied, you need to be admin of the box you are running against…or at least member of DHCP Administrators I would think.

As far setting the correct DNS servers on option 6, you can use the same module to set it, I did it by hand, since there were just a handful of scopes.

Hope this helps someone out there with their DHCP Reporting.