Know thy Hypervisor’s limits: VMware HA, DRS and DPM

Last week I was setting up a vSphere Cluster and like any good admin, I was test driving all its features, making sure everything was working fine. As a side note, I’m trying to squeeze as much values of the vSphere Licenses we currently have, so I’ve set this cluster up with lots of the bells and whistles ESXi has, like:

  • Distributed virtual Switches
  • SDRS Datastore Clusters
  • NetIOC and SIOC
  • VMware HA, DRS and Distributed Power Management

(In v5.1 a lot of them have gotten better, more mature, less buggy)

So here I was, had this 5 host cluster  (ESXi1 to 5) setup nicely and I was testing VMware’s HA, but in slightly different scenario, one when DPM said “hey power down these 3 hosts you don’t need them, you have enough capacity”. Fine…”Apply DRS Recommendation” I did, hosts went to standby.

So there I had my 201  dual core test VMs running on just 2 servers (mind you this cluster is just for testing, so VMs were mostly idle). Time to do some damage:

Let me tell you what happens if you have one of the remaining blades go down, say ESXi1:

  1. First HA kicks in and notices, ESXi1 is not responding to hearbeats via IP  or Storage Networks, so that means host is down, not isolated.
  2. HA figures out ESXi1 had VMs running on it that were protected and starts powering ON VMs. Also DRS will eventually figure out you need more capacity and start powering on some of your standby hosts.
  3. HA will power on all your VMs just fine, by the time it finishes, DRS still had not onlined some standby hosts….so I ended up with 1 VM that HA did not manage to power on, bummer!

I investigated this issue. First stop – event viewer on the surviving host had this to say

“vSphere HA unsuccessfully failed over TestVM007. vSphere HA will retry if the maximum number of attempts has not been exceeded.

Reason: Failed to power on VM. warning

6/27/2013 1:39:00 PM
TestVM007″

Related events also showed this information:

“Failed to power on VM.
Could not power on VM : Number of running VCPUs limit exceeded.
Max VCPUs limit reached: 400 (7781 worlds)
SharedArea: Unable to find ‘vmkCrossProfShared’ in SHARED_PER_VCPU_VMX area.
SharedArea: Unable to find ‘nmiShared’ in SHARED_PER_VCPU_VMX area.
SharedArea: Unable to find ‘testSharedAreaPtr’ in SHARED_PER_VM_VMX area.”

I then went straight to the ESXi 5.1 configuration maximums…for sure an ESXi host can take more than 400 vCPUs right? And there it was, page 2:

“Virtual CPUs per host 2048

Ok…I’m at 400 nowhere near that limit. Then I find this KB article from vmware… no help since it seems to apply to ESXi 4.x not 5.1. Also you can’t find the advanced configuration item they mention in the paper. I looked for it using web client, vsphere client and powerCLI. It’s not there. However you do see the value listed if you run this from an ssh session on the target host….but I suspect it is not configurable:

# esxcli system settings kernel list | grep maxVCPUsPerCore

maxVCPUsPerCore uint32 Max number of VCPUs should run on a single core. 0 == determine at runtime 0 0 0

I go back to the maximums document and read also on page 2:

“Virtual CPUs per core 25

OK…So the message said 400vCPU limit reached:

400/25 = 16 – the exact number of cores I have on my ESXi boxes.

Eureka-Moment

So kids, I managed to reach one of ESXi’s limits with my configuration. Which makes you wonder a little about running high density ESXi hosts…and VMWARE’s claim that they can run 2000 vCPUs…sure they can, if you can run 40 physical ones in one box 🙂

My hosts had 16 pCPUs and 192GB  RAM and half the RAM slots were still empty so I could in theory double down the RAM and  stuff each ESXi server with 200VMs each with 2vCPUs and 1-2GB RAM…and I would not be able to failover in case of failure and other scenarios.

Where else does vCPU to pCPU limit manifest itself?

I also tried to see exactly what other scenarios might cause vCenter and ESXi to act up and misbehave. Here’s what I’ve got sofar:

Scenario A: 5 host DRS cluster, with VMware HA enabled, percentage based failover, and admission control enabled. 201 VMs running, 2vCPUs each:

  • Put each host in maintenance mode, until DRS stop migrating VMs for you and enter maintenance mode is stuck at 2%. Note that no VMs are migrated from the evacuated host, but this is due to Admission control, not the hitting vCPU ceiling.
  • Once you reach that point DRS will issue a fault, that it has no more resources, and will not initiate a single vMotion. The error I got was

Insufficient resources to satisfy configured failover level for vSphere HA.

Migrate TestVM007 from ESXi1.contoso.com to any host”

Scenario B: 5 Host DRS cluster, with VMware HA enabled, percentage based failover, and admission control disabled. 401 VMs running, 2vCPUs each:

  • Put each host in maintenance mode, until DRS stop migrating VMs for you and enter maintenance mode is stuck at 2%. Note that VMs are migrated from the evacuated host, but only when you hit the 400vCPU limit do the vMotions stop.
  • The error you get is:

“The VM failed to resume on the destination during early power on.

Failed to power on VM.
Could not power on VM : Number of running VCPUs limit exceeded.
Max VCPUs limit reached: 400 (7781 worlds)
SharedArea: Unable to find ‘vmkCrossProfShared’ in SHARED_PER_VCPU_VMX area.
SharedArea: Unable to find ‘nmiShared’ in SHARED_PER_VCPU_VMX area.
SharedArea: Unable to find ‘testSharedAreaPtr’ in SHARED_PER_VM_VMX area.”

So a similar message to the one you get when HA can’t power on a VM.

To wrap this up, I think there might be some corner cases where people might start to see this behaviour (I’m thinking in VDI environments mostly), and it would be very wise to take a serious look at the vCPU : pCPU ratio in failover scenarios to avoid hitting vSphere ESXi’s maximum values.

vSphere Web Client 5.1 on Windows Server 2012 not starting up – Adobe Flash Player unavailable

I’ve been doing some work the past week around setting up a new vSphere based Datacenter for a customer and I’ve found out that working with the latest and greatest versions of Windows and vCenter isn’t always working without a hitch, as it should. For example if you try to run the vSphere client on Windows Server 2012, you will have to jump trough quite a number of hoops to get it to run…but wait…vSphere Client is deprecated as of version 5.0 and although I still think it is better than the web client in this day, June 2013, the way forward is the web client, and we will not be able to access 5.x features from the vSphere client from now on….so we better get with the program and work on using the vSphere Web Client only.

First step when you try to run the client, is first connect directly to the web page of vcenter, let’s say https://vcenter.contoso.com. If you are on the default Windows Server 2012 install, with IE Enhanced Security Configuration enabled, you will get a broken web page like the one below:

vCenter-Broken-Page

There’s an easy fix to this:

  • First disable IE Enhanced Security Configuration from the Server Manager GUI. Do not disable IE ESC if this is a server running production workloads unless you understand and accept the risks involved.
  • Then simply go to Internet Options > Security > Local Intranet > Sites > add your vCenter URL, click OK to close all Windows and then reload your page.

After that the web page opens up correctly, you click the “Log in to vSphere Web Client” link…And surprise, you get this message:

vSphere-webClient-broken

Isn’t this great…you now need Adobe Flash Player installed on the system. I should have known better, VMWare’s web client is built using Adobe Flex technology, so you do need Flash Player to run it.

So, happily you click on the “Get Adobe Flash Player” link, and you are redirected to Adobe’s page that says…”Hey you are running Windows 8, you should have Flash Player installed”…well Actually I’m running Windows 2012, so things might be a little more locked down than that. I followed the troubleshooting FAQ from Adobe’s website only to find out I didn’t have the plugin installed, as I said, this is Windows Server 2012, not all the bells must be turned on.

I did a little digging and as it turns out, on Windows 2012, there is such a thing called “Desktop Experience”, a Windows Feature. Among the things this Windows feature will bring, this webpage, lists “Adobe Flash Player”  in the miscellaneous section.  That webpage is also going to guide you through the GUI based installation of the Desktop Experience, it’s basically click click, next next until you get the damn thing installed.

If you want to do it via Powershell, here’s the one liner that will help you out. Either way you choose to install it, you will need a reboot once it is completed:

import-module ServerManager
Add-WindowsFeature -name Desktop-Experience,qWave `
-IncludeAllSubFeature -IncludeManagementTools

The qWave feature is Quality Windows Audio-Video Experience – this just in case you were wondering what that is doing there…someone on MSDN suggested to add this to Desktop Experience…so I did, you can leave it out.

After rebooting the box go to the Web Client page, and yo and behold the webpage will load  correctly. If this is your first time loading up the Web Client, after successful authentication you will get one error and a message from Adobe Flash.

The error is something like “Web client has encountered an error #Some Number“. On top of it comes Adobe Flash with the message that asks for permission that the webpage store more than 1MB of flash data on your hard drive. Click Accept, then on the error message choose to reload the client, and all will be well with the VMware world again, you can manage your infrastructure without other annoyances.

Hope this helps others out there, let me know if this worked for you or not

Automate Replacing of Certificates in vCenter 5.1

A few days ago, VMware launched a much awaited tool, called SSL Certificate Automation Tool. This tool enables VMware administrators to automate the process by which they replace expired/self-signed certificates on all components of the VMware vCenter management suite. As many of you know this process, especially in the new v5.1 version is a complete pain to implement, error prone, and so many steps to follow that you are bound to make a mistake. Compared to say VMware vCenter 4.x, version 5.1 has more “standalone” components that need to interact with users or interact with each other to provide users with data/visualizations and must do that over secure connections.

The components I’m talking about are (the ones highlighted in orange are new to version 5.x vs 4.x)

  • InventoryService
  • SSO
  • vCenter
  • WebClient
  • LogBrowser
  • UpdateManager
  • Orchestrator

It might not look like much but it’s almost double the number of components, double the number of certificates and close to double the number of interactions between the components themselves. To make is “worse”, the workflow for replacing v4.x certificates  (I have this post where I automated the whole process for ESXi 4.x, extendable for vCenter 4.x) is different than the workflow for v5.x ones, much more complicated. Here‘s a document how to manually do it. But if you value your time read on, there’s  an easier way.

The relevant documentation for this automation process is available here from VMware. Essentially it a 2 step process:

Step1: Generate your certificates using OpenSSL and your Internal Windows Enterprise root CA, as per this document(Generating certificates for use with the VMware SSL Certificate Automation Tool (2044696)

Step2: Use the SSL Certificate Automation tool to deploy the certificates, as described here.

Automate Certificate Generation for use with the VMware SSL Certificate Automation Tool

I should say this again, your main source of information should be VMware’s KB article. The information here is either echoing that, or supplementing it where needed. Also things are presented in a different order in my post, so the whole process can be automated, unlike the KB which assumes manual work.

Prerequisites

1. The account you will use in this step must:

  • Be a Local administrator on the computer where the script presented will run
  • Be able to enroll certificates for the Certificate Template that will be used from your PKI infrastructure.
  • Be a Local Administrator on the servers where the vCenter components are installed.

2. Any commands/scripts presented here should be run from an elevated prompt.

3. Name resolution must work correctly on the client where this script will run (all vCenter components must be resolvable via DNS).

4. You must use the OpenSSLversion that VMware specifies, not newer, not older, that is openSSL 0.98 at the time of writing this post.

5. You need to have a certificate template configured according to VMware’s specifications, that means basically duplicating the default web server certificate of the Windows CA, with a few changes:

  • Go to Certificate Manager > Extensions > Key Usage > Allow encryption of user data
  • Also uncheck “allow private key to be exported” as the script will fail if the template allows this.

6. Your CA must have automatic approval activated so you can obtain the certificate using command line.

7. Obtain and save the certificates of your root CA and any intermediate CAs, as described in the KB. From the KB, as I understand it, you should name your root CA certificate root64.cer (x.509 format, base64 encoded). Any other intermediate or issuing CA certificates should be named root64-#.cer (x.509 format, base64 encoded), where # is a number starting from 1 to as many intermediates you have (where root64-1 is the intermediate closest to the issued certificate, and root64-N is the certificate closest to the root CA). Place all the certificates from the chain in a folder called “RootCA_chain“.

Note: Naming the files in a certain way is important. The KB does not do a good enough job of explaining how to build the .PEM file (either that, or I’m not doing a good enough job of understanding it), so make sure your CA chain certificates are numbered like this. if you don’t do this way you will have to rework the piece of code that sorts the certificate files for building the .PEM file.

8. Make yourself a custom openSSL configuration file. The configuration file must include only these lines, not more, not less (it’s a copy paste from VMware’s KB). Save this file as custom_openssl.cfg in the \bin directory where you installed OpenSSL.

[ req ]
default_bits = 2048
default_keyfile = rui.key
distinguished_name = req_distinguished_name
encrypt_key = no
prompt = no
string_mask = nombstr
req_extensions = v3_req

[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = digitalSignature, keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth, clientAuth
subjectAltName = DNS:ServerShortName, IP:ServerIPAddress, DNS:server.contoso.com #examples only

[ req_distinguished_name ] # change these settings for your environment
countryName = US
stateOrProvinceName = Change State
localityName = Change City
0.organizationName = Change Company Name
organizationalUnitName = ChangeMe
commonName = Changeserver.contoso.com

9. Build a .csv file like this example below:

Name,DomainName,OUName
CNTvcS,contoso.com,vCenterServer
CNTvcO,contoso.com,VMwareOrchestrator
CNTvcIS,contoso.com,vCenterInventoryService
CNTVUM,contoso.com,VMwareUpdateManager
CNTwc,contoso.com,vCenterWebClient
CNTwc,contoso.com,,vCenterLogBrowser
CNTsso,contoso.com,vCenterSSO

As you can see, you need to specify a server name, domain name, and OU Name. The OU name is the name of the vCenter component, the value will be written in the “organizationalUnitName” part of the OpenSSl configuration file.

10. Create a folder (the script uses a static c:\temp\certs entry) where you will store all your certificates as they are generated. To this folder copy the RootCA_chain folder created above.

Just like with cooking you now have all the “ingredients” to generate your vCenter certificates.

The Script

You can download the script (Generate_vCenter_Certs_v5.1) from on my blog. My previous post on VMware certificates has a little more background information, that I won’t repeat here.

Learning Points

Lines 16-50: The script takes a few parameters, and also does some error checking on them.

    • The location of the openSSL directory ($OpenSSLPath)
    • The CSV from step 9, with all the servers and components of vcenter ($vCenterHostsFile)
    • The name\display name combination for your issuing/root Certification authority ($CAMachineName_CAName).
    • The name of the template to be used when issuing the certificates ($TemplateName).

Lines 52-71: More variables are set, and the script checks whether you have the root CA files in the specified location (default: c:\temp\certs\RootCA_chain).

Also this is where you should go in the script and change the Country, Company, State, and City Name. These will replace the values in the custom_openssl.cfg file.

Lines 76-78: Create a folder based on the Name-OUName combination to store all files used in certificate generation (CSR, private key, certificate, copy of the root CA chain certificates).

Lines 81-94: Customize the fields from the custom_openSSL.cfg file to match the specific server-service combination, according to the CSV file.

Lines 96-114:  We copy the template .cfg file to the working directory and replace each corresponding row with our values for name, city, state, etc. This is not written in the cleanest powershell for string manipulation, but it gets the job done.

Lines 116-131: This is where we generate the CSR to send to the CA. As per VMware’s instructions this is a 2 step process, first we generate the CSR, then we convert the private key to RSA format.

Lines 136-146: We use the windows tool certreq  to request a certificate and also retrieve the certificate. This is where being local admin in an elevated prompt comes into play, and also where automatic retrieval of certificates comes in handy.

Lines 151-159: We copy over the individual certificates of the Root/Intermediate/Issuing CAs to the working directory. Then we create a file called chain.pem, which will be the .PEM chained certificate file for the given server-component combination specified in the CSV.

Important!!! This is where reading the VMware KB wrong can cost you, and you won’t realize it until you run the SSL Automation Tool. I quote from the KB:

“Open the Root64.cer file in Notepad and paste the contents of the file into the chain.pem file right after the certificate section. Be sure that there is no whitespace in the file in between certificates.

Note: Complete this action for each intermediate certificate authority as well.”

This is where I initially made a mistake, and did it the wrong way.

  • I pasted the certificate in the .pem
  • I pasted the root after the certificate in the .pem
  • I pasted the intermediate/issuing CA certificates in the .pem after the root certificate

It is “do the same for each intermediate certificate” as in “sandwich your intermediate CAs certificates between the root and the certificate in descending order”, THE EXACT OPPOSITE of how Windows will display the certificate chain in the GUI. This is why I asked you in the first place to number your certificates in a specific way.

As a result, if you named your CA certificates like I explained above this code snippet will arrange them in the proper order and chain them in the .pem file correctly.

$chain_pem = "$rootdir\$($strHost.Name)-$($strHost.OUName)\chain.pem"
gc $crt | add-content $chain_pem</span>
$roots = dir "$rootdir\$($strHost.Name)-$($strHost.OUName)" root*.cer | sort name -desc
$roots | % {gc $_.FullName | add-content $chain_pem }

That wraps it up, needless to say, test this script before you let it run loose on your environment. Since the script has a lot of “Read-Host” in it, it is designed to run with pauses, to give you a chance to review the output, and cancel if there is a problem. After all you are running commands against your Certificate authority, so handle with care.

Once you have all the files you should follow step 2, and actually use the SSL automation tool.

Replace vCenter Certificates using the SSL Automation Tool

For reference here is VMware’s KB if you missed it above for using this tool. I don’t have much to say about this, other than stick to the document, but here are a few tips that will make your life easier:

  • If you have multiple servers running each service, it is best that you copy ALL the folders generated by the script (for each server-service combination) to each of the individual servers, and you can freely dispose of them once you successfully replaced the certificates.
  • Before you go ahead and copy over the Automation Tool, take the time to modify the SSL_Environment.bat file pointing each variable to its respective file/value. Then copy the SSL_Environment file to each server. This way each time you run the tool to update a certificate it will always know where to pick “inter-component-trust” certificates, user prompts and so on.
  • When you fill in the “set sso_admin_user=” variable put it in the format “user@domain” as it will give an invalid credentials error when you will run some steps.

I know it is a long read, but it is not an easy topic and I hope it helped you in your environment. Let me know in the comments if there is a way to improve this or an easier way to do this whole process.

Automate vSphere Certificate Generation

A couple of weeks ago I was working on some audit internally, and I discovered we had some vSphere servers working with self generated certificates. While these servers were un-managed servers (esxi free license servers), they still needed certificates, as it is the case with such servers, they are “critical”, just not critical enough to warrant licenses :).

The “problem’ with vSphere certificates is that they have to be generated using OpenSSL and you cannot generate them using Windows tools like, certreq. With certreq you could potentially have done this process much easier. Also there is an issue with using the request files given out by OpenSSL as it does not have template information written in it, and the Windows CA cannot generate a certificate if it does not know which kind of certificate you want.

I trawled the internet for ways to automate this, and I didn’t find an end to end solution for certificate generation. I only found bits and pieces, and people were writing how to do each certificate one by one. This didn’t sit well with me, and looking at the workflows I discovered there was really no point not having a script that does “it” automatically. I will define what “it’ is, by making a short description of the steps required for generating a vSphere certificate:

  1. Generate CSR file and key file using OpenSSL
  2. Submit CSR file to certification authority
  3. Retrieve response from certification authority
  4. Rename certificate file and key file  and upload to vSphere host

Some notes regarding the setup in which this would work:

  • I used Powershell to automate this, so this won’t work on other platforms.
  • I used a Windows 2008 R2 PKI CA with a “Web Server” Template.
  • The CA also had automatic approval for this type of certificate (which made automating the response retrieval easier)
  • User running this script needs to have the right to request/issue the given certificate template, also should be local admin on the box you are running the script, otherwise you would have to modify script to run some parts of the commands with “runas”

The script

I used a preexisting script to get started, the one for certificate mass generation from valcolabs.com, found here.

What differs from the way they did it, is that I’ve changed the way variables are passed for building the “config file”,  and the fact that each CSR has its own config file, specified on command line. This will help you track your work better for troubleshooting purposes. Something that should be noted is that their script, and also mine, use a special openssl config file, in the sense that the lines to be modified by the script are numbered, not searched in the file, so beware of making changes to the “custom_openssl.cfg” file. It could have probably been more elegant to search for the lines in the file, but I didn’t want to spend time getting it to work.

The download link for the script I built is this one; Generate-vSphere-Cert, below you will find some explanations on how it works.

Learning points

The script takes some parameters as input (get some of them wrong and your script might not work as intended or quit)

a) vSphereHostFile – is a CSV file that must contain the host name and domain name in 2 separate columns.

b) CAMachineName_CAName is the name of your CA in the format (hostname\display name)

c) TemplateName is the name of the certificate template you want to use for certificate generation, as defined on your CA

Lines 32 – 44 you should change the variables there to match your requirements (different paths, different location, country, email, company, etc). There is room for improvement here, you can include this info in the csv file, useful for creating certificates for multiple companies, with different contact information.

Lines 49 – 73 – build out a folder structure, one folder per host where all host files will be stored. Also builds CN, SAN’s (Subject Alternate Names)  – you may wish to customize what you add here. I added short name, FQDN, i left out IP address as that can change more easily than the name.

Lines 80-97 – use a temporary file from the original openssl config file containing the parameters we setup until now – this piece of code uses numbered lines, so if you make changes to the original file, change the line numbers here)

Lines 99-104 – build out the file/paths to generate a CSR with openssl. The command i used is slightly different than the ones on the internet, I needed a special length for the RSA, so I used:

"$openssldir\openssl.exe req -newkey rsa:2048 -out $csr -keyout $key -config $config"

Lines 109-114 – build paths for files to send/receive to/from the Windows CA. I also used something “unusual” (as in, not your first page results on google search) which is specifying the CAName and Template name.

The CA name is needed so you do not get a prompt each time certreq is invoked.

The certificate template is specified using the attrib parameter, the missing piece of my “how to automate” CSR submitting, see below:

$ConfigString = """$CAMachineName_CAName"""
$attrib = "CertificateTemplate:$TemplateName"
$issuecerts_cmd = "certreq -submit -attrib $attrib -config $ConfigString $csr $crt $p7b $rsp "

Lines 117-122 Unless you use this script for automating creation of vCenter Certificates, you can comment these lines out. They generate a PFX certificate which is required with vCenter. PFX certificates are not not required for vSphere host certificates.

The next step to automation would be to upload these files to your vSphere host. I used this script here and changed some paths to suit my folder structure. You can also use SCP or other methods to upload the file. After the files are uploaded you need to reboot the host for the certificates to take effect.

As always with these scripts, do your best to try them in a test environment before unleashing them into production. You are dealing with Certification Authorities and your vSphere hosts. Failure to upload a correct certificate to the hosts will result in you not being able to connect with vSphere Client, and having to go to console (NOT SSH) and regenerate self signed certificate.

I hope this was a useful read, comments and critique are open, as always.

Tracking vCenter VM and DB

It has been a while since I managed to do some writing on my blog, mostly because I’ve been busy with other Real Life events, and general lack of time. But now I’m here to share something that has been sitting in my drafts folder for a while. This one is about virtualization.

2010 and 2011 were virtualization years for me, I worked on several projects in design, implementation, and I learned so much, that looking back I really get a feeling of accomplishment.
I’ve also been a little “cutting edge”, non conservative with my designs some would say. I guess practice what you preach kind of stuck with me and I made it my mission to build reliable, self contained VMware environments, as much as possible.

As part of the design process, you always have to think about your management software

  • Where do you put the pieces of software that help you manage the environment?
  • How do you ensure availability and SLA for these components to allow you to recover from failures?

The answer to the first question can be:

Option A: In a management cluster, dedicated to management software for the virtualization stack

  • The advantage is you always know where the VMs are, if you have a failure there are 2 servers they can be on.
  • The disadvantage is you dedicate two physical boxes for this purpose, which can have a maximum utilization of around 40% for failover reasons.

Option B: Next to production machines

  • The advantage is you don’t have to setup a management cluster, and you optimize resource utilization in your datacenter.
  • The disadvantage is that you lose “determinism”, the security of “I know on what server vCenter is sitting, so i don’t have to look for it”, if i get a cluster failure or worse.

Well I’ve come up with two “tricks” that tackle the drawbacks of the second option, not knowing where your management servers are, making it a preferred choice if your environment does not warrant a dedicated management cluster just for that.

#1 Track the movement of vCenter and vCenter DB using vCenter Alarms

This one is a really easy way to keep track of your vCenter components. It works best combined with the second trick you will see below, mainly because it does not cover all scenarios but the advantage of this method is that the information is provided in real time.

What I am proposing is that you create an alarm in vCenter, that monitors for events that change the VMhost of your vCenter VM. These events are:

  1. VM is being migrated (manually)
  2. VM is being migrated by DRS
  3. VM is being restarted by HA on another host

The third trigger will be hit and miss, it stands to reason, that if vCenter is not up to send the mail since it being restarted, you may or may not get the email, or you will get it after the fact. nevertheless it is good to have it there.

Below are the screenshots of how the alarm would look like:

On the advanced field put this condition in:

Then add some notification address or whatever you prefer

Save your alarm, and then try to migrate vCenter and see what happens. You should do this to the vCenter DB server aswell, and any components you feel you should know where they are, for troubleshooting purposes (VUM, Nexus 1000v Supervisor Modules, Management Appliances).

#2 Check the vSphere host where vCenter is running using a scheduled script

Another wasy to check where your vCenter components stay is using a scheduled PowerCLI script that runs once a day and sends you an email where vCenter VM and vCenter database are sitting (which vSphere host)

This script assumes following:

  1. vCenter VM name in inventory = vCenter VM hostname
  2. vCenter is using separate database, if you don’t care about that, you can remove the references to the DB.
  3. vCenter Database name in inventory = vCenter Database hostname or at least a cNAME with this name (e.g. RO-vcenter > RO-vCenter-DB name, and alias in DNS)

You can customize this by entering a CSV file of the names of the vcenter instances and their respective databases.

 #version 0.1
#initial release

Add-PSSnapin Vmware.VimAutomation.Core -ErrorAction:SilentlyContinue
Set-PowerCLIConfiguration -DefaultVIServerMode multiple -Confirm:$false
#Write-Host -ForegroundColor Yellow "This script Generates a report detailing which host has the vCenter VM and vCenter DB VM`
#If you wish to cancel Press Ctrl+C,otherwise press Enter"
#Read-Host

#using fqdn because certificates are issued using a FQDN
$vCenter = ('vcenter','vcenter2','vcenter3')

If ($global:DefaultVIServers -ne $null) {
	DisConnect-VIServer * -Force -Confirm:$false }
$vCenter | % {Connect-VIServer $_ -NotDefault:$false}

$Report = @()
$vCenters = $global:DefaultVIServers | % {
	$row = "" | select vCenterInstance,FrontendVMHost,DBVMName,DBVMHost
	$row.vCenterInstance = $_.Name
	$row.FrontendVMHost = (get-vm -Name $_.Name.Split(".")[0] -server $_.Name).VMHost
	#db is hostname + db
	$dbvm = "$($_.Name.Split(".")[0])DB"
	$DBVMName = ([System.Net.Dns]::GetHostByName($dbvm)).HostName.Split(".")[0]
	$row.DBVMName = $DBVMName
	$row.DBVMHost = (get-vm -Name $DBVMName* -server $_.Name).VMHost
	$Report += $row
}

$FileDate = get-date -Uformat "%Y%m%d-%H%M%S"
$Path = "c:\temp\vsphere\"
$File = "$FileDate-vCenter-InfraLocation.csv"
$Report | Export-Csv -NoTypeInformation -UseCulture "$Path$File"

$encoding = [System.Text.Encoding]::UTF8
#I made the convoluted out-string construct because the object cannot be serialized"
$ReportBody = $null
$ReportBody += $Report | % { "
`n
`n$($_.vCenterInstance)`n$($_.FrontendVMHost)`n$($_.DBVMName)`n$($_.DBVMHost)"}$Body = "</pre>
<div>I'm the PowerCLI Magic Script. This is the list of your vCenter instances and their locations in the Infrastructure.
`
If you ever lose track of them, this email is the reminder. The latest update is from $FileDate
`
Below is the detailed information about each Instance:
`
`n`n`n`n`n`n$ReportBody`n`n
<table border="`&quot;1`&quot;">
<tbody>
<tr>
<td>vCenterInstance</td>
<td>FrontendVMHost</td>
<td>DBVMName</td>
<td>DBVMHost</td>
</tr>
</tbody>
</table>
</div>
<div>"

Send-MailMessage -Smtpserver smtpserver -From 'admin_vmware@foo.com' -To 'vSphereAdministrators@foo.com' -Body $Body -Bodyashtml -Encoding $encoding -Subject "vCenter Instances List" -Attachments $Path$File

Learning points:

Line 11: This is where you define you vCenter server names, if you have more of them. I had 3 for example.

Line 22 & 27: This is where you perform a get-vm to find out the host where this VM is residing on

The rest of the script is just to cycle through all vCenter instances and create an email that it sends to a given email address.

Perhaps to some people this may seem unnecessary, as they may not have faced major outages, perhaps to some it may seem that these monitoring tricks are not enough to cover monitoring of all ‘outages’ situations, but I find it is not worse than having a separate management cluster, with the added benefit of not having to deal with another separate management cluster.

C&C as always is welcome

Things to keep in mind about Snapshots

Some time ago I setup a VMWare environment, and I was involved in sizing and design decisions. I did a lot of reading about how to size the VMFS datastores how many VMDKs per datastore, how to calculate appropriate size. Everyone on the web mentioned you have to take into account snapshot size, so I did (for a good read on snapshots try this post by VMWare). I split VMFS datastores according to roles (Logs, Database, OS, swap) and accounted for a snapshot allowance for each datastore.

Fastforward 3 months later and a couple of snapshot VMs and I do a usage report on the datastores to notice something I didn’t expect. I used the VMware vCenter reporting features to get a disk usage (which are pretty sweet by the way). I was amazed the report said zero space used for snapshots (although those VMs had snapshots and VDMKs on the datastores). I cycled through the Datastores and found where the snapshots were stored. They were stored on the Datastore where the OS was found, same where the config file was located, then I looked it up in the documentation and found this:

  • The default location for snapshots of Virtual Machines is their Working Directory.

  • The default Working Directory is the datastore where the Configuration File (.vmx) of the VM is stored.

Wow, that was unexpected, for me at least since that meant I undersized my OS datastore a little. So this question haunted me, ok, how to change this setting in dire situations, when you want to avoid VMs crashing because your datastore is out of space. I then did more research and discovered this:

  • Default Working Directory can be changed if you change the VMX file using by adding/changing this line: workingDir=”path/path/”

  • Doing so will ALSO change the location of your .vswp file (the swap file created by vSphere) to the location specified by “WorkingDir”

According to this article you can also specify the location of the swap file within the VMX by adding this line: sched.swap.dir = “/vmfs/volumes/Volume1/VM/”. However this setting or adding the workingdir to the configuration file will take effect over the “Store Virtual Machine Swap file in location specified by the Host” option (on the logic that VM settings take precedence over host settings, unless defaults are used for VM – please correct me if wrong)

The consequence of this is that you no longer define swap file storage at host level (which was pretty easy because you have much fewer hosts) instead you define it at VM level (which you may have in the hundreds). Taking this further you’d probably have to use powershell to set this easily…and have this thorougly documented for each VM.

You can see how from something relatively benign changing defaults for Snapshots turns into quite an administrative burden. Then you have to balance administrative burden vs reisizing datastores.

Datastore sizing – revisited

Now with this information the way datastores are sized get a little more complex. Prior to me knowing about this I read what really smart and knowledgeable people had to say about about datastore sizing and it went a little like this:

(Avg VM * #VMs ) * (100% + (Snapshot Allowance) + 10% Reserve)

Snapshot allowance was 10-20%.

Now that is great for datastores that hold the entire VM inside it, I wanted to separate I/O you have to create multiple datastores and each VM can have more than 1 VMDK the math above applies to a single type of Datastore (e.g. for a Dastore for DB vdmks)

(Avg DB VMDK * #VMDKs ) * (100% + (Snapshot Allowance) + 10%Reserve)


In light of my recent discovery about snapshots, the math changes yet again, the sizing would be:

(AvgVMDK * #VMDKs) * (100% + 10% Reserve)

Now assuming you store the VMDK where you store your OS VDMDK sizing this DataStore changes as follows:

(AvgVDMK * #VMDKs) * (100% + 10%Reserve)+(Other Datastores [db,app,log,swap]) + Snapshot Allowance

Where Snapshot Allowance is now sized different:

Snapshot Allowance = (OS Datastore Size + DB/App/Log/Swap Datastore Size) * (10-20%)


In essence if no VMware snapshot defaults are changed and snapshots will be used (they are found in a lot of processes within VMware – backup solutions, VDI, development, patch management of guests) the space occupied by these snapshots is important and it is also important where snapshots consume this space from. Whatever the design, it must include some form or “snapshots space management” to use some fancy words for it. Any comments or different angles on this are welcome as usual.

Change vSphere Service Console IP

Now I get a chance to write an article I’ve been meaning to about something I’ve run into while working with vSphere 4.1. Initially I’ve called it a “bug”” (may have said on twitter I guess), now I’m starting to think “it serves me right” in a way. It is about what happens when you want to change the vSphere Service Console IP, of a host that is already in a cluster. Here’s the history:

  • 3 Hosts configured in a cluster. After some weeks it was decided that we had to change the IP’s and vLAN , to make room for some other vLANs that needed room to grow.
  • No problem, get the new IP’s, talk to the network guys to trunk the ports on the physical hosts and reconfigure switches to make sure that traffic can talk to our vCenter Server.
  • Google for how to change the Service Console IP….5 minutes later Google for how to change also the vLAN ID of the Service Console. So for changing the IP and vLAN these are 2 good places to start.
    • Place host in maintenance mode (while still in cluster – we chose to not remove it or delete the cluster since we had resource pools configured)
    • Make all the change (IP, gateway, hosts file)
    • Test settings (ping, nslookup)
    • Now once all hosts are reconfigured properly we update each host hosts file with the updated IP/hostname entries for the other nodes in a cluster.
  • Obviously when I took each host out of maintenance mode our clusters would not work, to be expected.
  • Now…let’s reconfigure vSphere Cluster since it was not a proper cluster anymore. Reconfigure cluster finishes “Successfully”(task took longer than we expected it to), everything seems great.

Fast forward a few days later, I do a routine configuration check of the systems and our cluster starts to throw “HA agent misconfigured errors”. I discover although I updated the hosts file on vSphere, the OLD ip addresses were still there. I mean there was a mix of the old settings and new settings. I start asking my colleagues if anyone made any changes, but no one had done anything. After some troubleshooting (which included a file level search for files where that IP may be listed on the vSphere host) I concluded this:

“When you try to reconfigure the IP address of a host that is in a cluster, and then you Reconfigure the cluster for HA, somewhere (maybe vCenter DB) information about the IP’s of the hosts is stored, as they were joined to the Cluster initially! Therefore any cluster reconfiguration of hosts with new IP’s will get a mix of old IP and new IP in the /etc/hosts file and possibly Reconfigure for HA Errors

To fix this, obviously we disabled HA, disbanded the cluster and recreated it back again.

The right way to change vSphere Service Console IP

In light of these issues these are the steps to properly change the IP address of a host:

  1. If host is in a cluster, remove it from the cluster.
  2. Put host in maintenance mode.
  3. Disconnect from vCenter
  4. Login to physical (or remote KVM) console and change IP settings. Change the gateway by editing /etc/sysconfig/network so that the GATEWAY line is pointing to your new gateway. Change the IP using these commands.
esxcfg-vswif -i <new IP > -n <new Mask> vswif0
esxcfg-vswitch vSwitch0 -p <port group Name> -v <VLAN ID>
esxcfg-vswif -s vswif0
esxcfg-vswif -e vswif0

5. Ping your reconfigured host to see all is working properly.

6.Rejoin host to the cluster, reconfigure for HA (let HA reconfigure your hosts file instead of manual changing it). Enjoy not having to worry about cluster issues 🙂

A colleague of mine also wrote this “interactive script” that prompts you for required information for changing all these settings, I’m a bit LSI (Linux Shell Impaired).

#!/bin/sh
echo "New IP :"
read new_ip
echo "New Mask:"
read new_mask
echo "New Gw:"
read new_gw
echo "New vlan:"
read new_vlan
sed -i "s/`cat /etc/sysconfig/network |grep GATEWAY=|cut -d = -f 2`/$new_gw/g" /etc/sysconfig/network

esxcfg-vswif -i $new_ip -n $new_mask vswif0
esxcfg-vswitch vSwitch0 -p "Service Console" -v $new_vlan

esxcfg-vswif -s vswif0
esxcfg-vswif -e vswif0

I hope you enjoyed the read, and remember:

If you need to change the IP of a host in a cluster….remove it from the cluster first, saves yourself some time and braincells. Comments and critique are welcome, as usual.

Fix “Transaction log for database ‘VIM_VCDB’ is full” errors

This is one of those “note to self posts”, in hope this may hit me again so I don’t go wandering the Internet all over again. I have a small VMware lab at home, and a few days ago I was confronted with an issue related to vCenter – the management application for VMware’s hypervisor. I tried to connect to my vCenter installation – connection refused….ok, I’ve seen this before, probably the service is not up. Initially I thought there had been a power outage at my home (they kinda happen) and the vCenter Service hanged upon starting (this also kinda happens)

No problem I can fix it! open services snap-in remote to vCenter machine, start service, service starts, close snapin. Start vSphere Client client works, play around with it a bit, close Client.

Time goes by, I need to log back into the system again for some work. Connection refused….now this is rich, no power outage, why is the service crashing? Ok, it’s just life treating me badly VMware is acting up (not that is usually does), open service, start service, login again to vCenter, do some work, few minutes later client disconnects…reconnect not working.

Ok, troubleshooting mode now; open Splunk, sort by events from that host, anything that is not information from the system log. And there it was:

Error[VdbODBCError] (-1) “ODBC error: (42000) – [Microsoft][SQL Native Client][SQL Server]The transaction log for database ‘VIM_VCDB’ is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases” is returned when executing SQL statement “UPDATE VPX_VM WITH (ROWLOCK) SET SUSPEND_TIME = ? , BOOT_TIME = ? , SUSPEND_INTERVAL = ? , QUESTION_INFO = ? , MEMORY_OVERHEAD = ? , TOOLS_MOUNTED = ? , MKS_CONNECTIONS = ? , FAULT_TOLERANCE_STATE = ? , RECORD_REPLAY_STATE = ? WHERE ID = ?”

Ouch, something really broke, Immediately I made quick check to see if I had disk space left, which I had, so this was not going to be this easy.

In that case: to the Internets! Found this thread on the VMware communities. I won’t bore you anymore with the storyline, I’ll just get to fixing this issue

Note: this is probably an extremely trivial topic that does not happen on production databases, with vigilant DBA;s. However this is a homelab and I’m not a DBA 🙂 and if you are reading this, probably so are you.

The Fix

To fix this you will need SQL Server Management Studio Express installed either on the server holding the databases or on a management machine (in which case you better know how to give yourself remote access to the vCenter Database Server, I couldn’t, so I installed it locally on the affected machine). You’l also need a local administrator account to run the management studio under.

Once in the management studio, select the VIM_VCDB database, right click properties:

On the left side of the new window select the File section:

So, there are 2 files, database and the logs. The error we got mentioned log files. A quick look in my setup revealed I had reserved only 460MB for logs (screenshot taken after fix). Scroll down to the right, and find the “…” button, which will let you configure the maximum size of the log files.

Now change this value to a bigger value, for a home lab 2GB is quite a lot actually, but i wanted to be safe. Close all windows by pressing OK, close the Management Studio.

After this restart VMware VirtualCenter Server service and watch your vCenter go :).

Now for a little investigation why this happened. The vCenter database holds performance data, VM metadata and the likes…but how could 8VM’s gather performance data in less than 2 months that fit into 460MB which was the configured size of the log file….Well the answer lies into vCenter Server Settings, once I started browsing the menus I remembered, that just for testing I configured the statistics logging level to 4 (highest) for each retention period, and not just for testing, I Forgot to turn it off, lesson learned now.

p.s. This my first non scripting post 🙂