Mike's Virtualization Blog

vCenter 5.1, Multisite SSO, Linked Mode, Custom SSL certificates, protected by vCenter HeartBeat. Part 1

Over the past couple of weeks I got to work with a client who generally would have not used PSO for a vCenter upgrade but because of the new features, such as SSO, introduced in 5.1 they brought me in.

The customer, like many, didn’t really understand SSO and the different options to deploy it. Many customers I talk to initially want SSO in HA, not understanding that this requires a load balancer in front of the SSO servers. While this option is perfectly viable there are a couple caveats to it. Since the admin role is only installed on the first SSO server when this server is down services that are registered to SSO, like vCenter, the Inventory Service, and the Web Client, will not be able to start. They will keep running with the admin service down but if a restart of the server or service takes place they will not be able to start. Another caveat is no administration of SSO or registering of additional services can take place. So really in this first release of SSO the only thing that is HA in HA mode is authentication.

Once this conversation takes place customers ask how to protect it. The easiest is to just use HA that’s built into vSphere. This is a good option if you’re only worried about host failures, or BSod’s (when enabled HA can recover crashed guests). If you also want to protect against outages due to OS issues vCenter HeartBeat provides this in addition to doing maintenance and rebooting VM’s with only taking a small outage to move the service to the secondary node.

I am, and have been for years, a big fan of vCenter Heartbeat. When a customer has the requirement to keep vCenter and related services available such as when other components like vCD, View, SRM, etc. are in the environment or they just wish to always have vCenter available I recommend they implement vCenter HeartBeat.

We were also deploying SRM the week after the vCenter upgrade, which meant we needed a 2nd vCenter as they had been running a single vCenter to manage both sites.

The customer also requires all certificates to be issued from their CA and to not use any self-signed certificates. As I’ll detail in the following posts, this lead to some interesting discoveries that I haven’t seen documented elsewhere.

The diagram below is how I recommended this customer implement vSphere 5.1 and is the direction we went. In it you’ll see in each site the following servers:

  • SQL Server protected by vCenter Heartbeat
  • Single Sign On (SSO), configured in Multisite mode, protected by vCenter Heartbeat
  • vCenter and the Inventory Service protected by vCenter Heartbeat
  • vSphere Web Client protected by vCenter Heartbeat
  • Update Manager protected by vCenter Heartbeat (this server also hosts storage plug-ins such as NetApp’s VSC)

Architecture Image

Windows 2008 R2 Templates / Customization Specification / Local Administrator Password

I haven’t done much with templates in quite a while. Last week a client requested I assist with creating a Windows 2008 R2 template. We installed the base OS, did some minor configuration such as installing VMware Tools, enabling RDP and disabling the firewall using netsh, since this template would be used for various server types and we had no Internet access to patch the templates we stopped there. We shutdown the server and converted it to a template.

We then went and created a customization specification, the spec had an Administrator password set and had the server licensing set to per user, everything else was pretty straight forward.

To my surprise when creating a new VM from this template with this specification we couldn’t login, the Administrator password used in the template nor the one set in the specification worked.

When working with the client last week the only thing we got to work was a blank Administrator password in both the template and the customization specification. Not something I was happy about or recommended going to production with.

I came home and worked in my lab and had the exact same issues. After many tests scenarios the only thing that finally worked was enabling “Automatically logon as Administrator”.

2008R2 Template

After enabling this option having a password in the template and a different one in the customization specification the password in the customization specification worked.

I’m sure others have run into this but doing some searches, the only resolution I found was to use the blank passwords so I hope this helps someone who is seeing this same issue.

Issue with SSO and SQL Express

This past week I was working with a large Telco on their first vSphere deployment. This particular deployment was with vSphere 5.1 and was a lot of fun getting to use some new features in the real world such as Auto Deploy Statefull Install, enhancements in the distributed switch, among others.

That’s not to say we didn’t have any issues along the way, which is were this post comes in. This environment is for a particular set of apps that will be running across 18 VM’s, not a large number by any means but the service this provider is selling is quite cool, you can read more about it here.

The provider wanted to keep all vCenter services in one VM and to use SQL Express, they were over the host limit, 2 management and 10 resource nodes, VM wise they are way under so we decided SQL Express would meet their needs for now and if they grew out the environment they would move to full SQL at that time.

The issue we ran into is when installing SSO with SQL Express the SSO install sees the port SQL Express is using at the time of install and configures its connection string with that port. SQL Express installs with the dynamic ports option which when the server is rebooted causes SSO to not be able to connect to its database which in turns means vCenter won’t start. In the imsSystem.log file you’ll see.

java.sql.SQLException: Network error IOException: Connection refused: connect

This is easily fixed by setting SQL to use a fixed port and reconfiguring the connection string for SSO. To set SQL Express to use a static port open up SQL Server Configuration Manager, expand SQL Server Network Configuration, click on Protocols for VIM_SQLEXP, on the right hand pane double click TCP/IP.

SQL TCP/IP

In the properties delete the 0 from dynamic ports and enter a port to use under TCP Port, something like the default SQL port of 1433 or any other available TCP port would work fine.

SQL Ports

Once you’ve made those changes restart the SQL service. Now we need to update SSO’s connection string and configuration. This is a two step process, first run:

c:\program files\VMware\Infrastructure\SSOServer\utils\ssocli configure-riat -a configure-db –database-host <database server> –database-port <port number> -m <master password>

Next open the file c:\program files\VMware\Infrastructure\SSOServer\webapps\lookupservice\WEB-INF\classes\config.properites and change the portNumber= part of the file to the port you used for SQL.

SSO config.properties

Restart the SSO service and all services will be able to connect to SSO and start normally.

SRM with Hitachi Storage

I’m currently at a client who is moving their SRM protected VM’s onto Hitachi VSP arrays. It’s been close to 6 years since I’ve done any work with Hitachi and replication so I assumed it would have gotten better since then, unfortunately it hasn’t, it’s exactly the same as it was.

The storage admin, thankfully, is pretty sharp and had replication (or as Hitachi calls it , pairs) setup already. We installed HORCM (Hitachi Online Remote Copy Manager) on the SRM servers which contains the commands used by the SRA. This requires way more manual editing of configuration files then should be necessary. Then installed the Hitachi SRA on both servers.

The goal of this SRM implementation is to enable fail-over from site a to b and to use the re-protect option to reverse the replication to enable fail-back to site a, currently site b is only a recovery site. Testing the recovery plans is also required by using Hitachi’s Copy-on-Write Snapshots.

After creating the horcm configuration files we threw a test VM onto one of the replicated datastores and started to test and immediately ran into errors.

By looking through the SRM logs it was obvious the horcm config files were our issue. After a couple hours of troubleshooting we finally got everything sorted out and working for all of our requirements. Surprisingly (to me at least) is that there is really no solid “Here’s how to configure SRM with Hitachi Storage” guides or blog posts out there, don’t even get me started on how bad Hitachi’s documentation is. So here’s the quick and dirty of what you need.

I’m assuming you already have vCenter and SRM installed in both sites with the SRM servers paired and the storage admin has used the paircreate command to create the required pairs.

  • Install HORCM to the default location (C:\HORCM).
  • Add C:\HORCM\etc to the systems path.
  • Install the Hitachi SRA, there are no options just press next a few times.
  • According to Hitachi’s documentation, two system (not user) environment variables need to be set on both SRM servers, SplitReplication=True and RMSRATMU=1, we found setting RMSRATMU=1 was one of our issues so we removed it.
  • Reboot the SRM servers.
  • The SRM server needs a command device, if the SRM servers are VM’s (which I would hope they are) you’ll need to create a RDM in physical compatibility mode and map it to the VM, let windows write a signature to the device but don’t format it. I found conflicting information on the size it should be, ranging from 30-40 MB so use 50 to be safe.
  • To issue commands HORCM needs to be running as a service, this is done by copying the horcm_run.txt from c:\HORCM\tool to horcmX_run.txt where X is the number for this instance, edit the copied file to set the HORCMINST variable save and close the file and run C:\HORCM\tool\svcexe /S=HORCMX /A=C:\HORCM\Tool\svcexe.exe again where X is the instance number. This needs to be done on both SRM servers.

Notes about HORCM instances, if you want to use Hitachi’s Copy-on-Write Snapshots you need two HORCM instances on the recovery site, the instance ID for the snapshots must be +1 of the replicated LUNS (LDEV’s as Hitachi calls them) so if you used HORCM10 for the instance for the replicated LUNS you must use HORCM11 for the snapshots or running test fail-overs will not work.
Once the services have been installed you must create the horcmX.conf files and place them in C:\HORCM\etc, again X is the HORCM instance ID.

Creating these files correctly is the hardest part of this config. Below are samples that should work for most instances that meet the same requirements we had.
Assuming I used HORCM instance ID 10 on the protected site (I’m going to reserve 11 in case the requirements for using Copy-on-Write in this site change later) and 12 and 13 in the remote site I’d have these 3 conf files in their respective C:\HORCM\etc directories. Be sure to change the IP addresses, CMD device, Serial# and LDEV#’s to match your environment.

horcm10.conf (protected site)

#/************************* For HORCM_MON *************************************/
HORCM_MON
#ip_address     service         poll(10ms)     timeout(30ms)
Local IP        horcm10          1000           3000

#/************************** For HORCM_CMD ************************************/
HORCM_CMD
#dev_name                dev_name                dev_name
\\.\CMD-11111 

#/************************** For HORCM_LDEV ***********************************/
HORCM_LDEV
#dev_group          dev_name    Serial#   CU:LDEV(LDEV#)   MU#
srm1                vm1         11111     0x111a             
srm1                vm2         11111     0x1110             
srm1                vm3         11111     0x1111             

#/************************* For HORCM_INST ************************************/
HORCM_INST
#dev_group  ip_address  service
srm1        Remote IP   horcm12

horcm12.conf (recovery site)

#/************************* For HORCM_MON *************************************/
HORCM_MON
#ip_address     service         poll(10ms)     timeout(30ms)
Local IP        horcm12          1000           3000

#/************************** For HORCM_CMD ************************************/
HORCM_CMD
#dev_name                dev_name                dev_name
\\.\CMD-11112

#/************************** For HORCM_LDEV ***********************************/
HORCM_LDEV
#dev_group      dev_name        Serial#   CU:LDEV(LDEV#)   MU#
srm1                vm1         11112     0x1003             
srm1                vm2         11112     0x1004             
srm1                vm3         11112     0x1005             
srm1-snap       vm1-snap        11112     0x1003            0 
srm1-snap       vm2-snap        11112     0x1004            0 
srm1-snap       vm3-snap        11112     0x1005            0 

#/************************* For HORCM_INST ************************************/
HORCM_INST
#dev_group      ip_address      service
srm1            Remote IP       horcm10
srm1-snap       Local IP        horcm13

horcm13.conf (recovery site)

#/************************* For HORCM_MON *************************************/
HORCM_MON
#ip_address     service         poll(10ms)     timeout(30ms)
Local IP        horcm13          1000           3000

#/************************** For HORCM_CMD ************************************/
HORCM_CMD
#dev_name                dev_name                dev_name
\\.\CMD-11112

#/************************** For HORCM_LDEV ***********************************/
HORCM_LDEV
#dev_group      dev_name        Serial#   CU:LDEV(LDEV#)   MU#
srm1-snap       vm1-snap        11112     0xa000             
srm1-snap       vm2-snap        11112     0xa001             
srm1-snap       vm3-snap        11112     0xa002             

#/************************* For HORCM_INST ************************************/
HORCM_INST
#dev_group          ip_address      service
srm1-snap           Local IP        horcm12

Notice that only horcm12 has any data in the MU (Mirror Unit) column and only for the snapshot devices. If this is blank running Recovery Plans in Test mode will fail, if there’s a value for every device everything (including adding the Array Manager) will fail.

Once these files are created and saved to C:\HORCM\etc you can start the HORCMX service(s) on each SRM server.

The configuration of the SRA is pretty straight forward, you add an Array Manager, give it a name for the site you’re working on and since HORCM is local to each server in the first field enter HORCMINST=X where X is the local HORCM instance for the replicated LUNS and then enter the username and password that’s been setup for SRM to use.

If your HORCM config files are correct the Array Manager will be added, repeat for the other site and run your normal SRM tests.

Issue with Host Based Replication can cause hostd to panic

I’m currently working on a SRM 5.0 project. This is the end of week 5 and up until yesterday everything was going perfectly.

A little background. This environment is running vSphere 5.0 Update 1b, vSphere 5.1 isn’t an option because of lack of support from Symantec on NetBackup. The client wants to use vSphere Replication, AKA Host Based Replication.

Yesterday the client noticed some of his hosts disconnecting from vCenter. The hosts were still available via SSH and all VM’s were still running perfectly fine. Restarting the management agents had no effect, the hosts would not reconnect to vCenter. At random times the hosts would disconnect and later reconnect on their own. Looking through the hostd log we discovered the issue:

2012-10-18T16:25:54.241Z [296C5B90 panic ‘Default’]
–>
–> Panic: Assert Failed: “_quiescedType == quiescedType” @ bora/vim/hostd/hbrsvc/ReplicationGroup.cpp:3505
–> Backtrace:
–> [00] rip 1bfabb43
–> [01] rip 1be035be
–> [02] rip 1bfa1b00
–> [03] rip 1bfa1c12
–> [04] rip 1bd9c036
–> [05] rip 057ddcc5
–> [06] rip 057ddf49
–> [07] rip 057de526
–> [08] rip 057b1f20
–> [09] rip 057b26ba
–> [10] rip 057b9f96
–> [11] rip 1bd7e78a
–> [12] rip 1bd78b1c
–> [13] rip 1bd79556
–> [14] rip 1bd7ba28
–> [15] rip 052d8501
–> [16] rip 1bfce3e1
–> [17] rip 1bfc9533
–> [18] rip 1bfca0d8
–> [19] rip 052d8501
–> [20] rip 1bfbe679
–> [21] rip 1c676852

After opening a case with GSS we learned that if a VM that is being replicated with multiple vmdk’s while each vmdk is in a different state, for example: one disk is done replicating while the other is not, and a state change on the VM occurs such as a power cycle or a snapshot create or delete the replication manager on the host incorrectly assumes all vmdk’s are in the same state and when they’re not it causes hostd to panic, this condition will continue until replication of the VM is complete or you reboot the host. The state change here that is triggering this is snapshots created by NetBackup.  This issue is not present in ESXi 5.1 and will be fixed in ESXi 5.0 in a future update.

There isn’t a KB on this (yet) so I wanted to let anyone who maybe seeing this issue know that it’s a known issue and being worked on.

 

The KB on this issue is now live: http://kb.vmware.com/kb/2030515

2012 off to a great start!

Looking back over the past four months I can say 2012 has been one of the best years for me both professionally and personally.

Professionally back in February I became VCDX certified, a couple months later I accepted a position with VMware and just last week I was recognized as a vExpert. I start my new position at VMware tomorrow and I couldn’t be more excited to hit the ground running.

Personally, my wife and I are expecting our third baby in August.  Over the past few months I’ve learned how to manage my work/life balance and it’s made me a better husband, father and I feel it’s made me more productive and focused with my career.

I’m hoping the rest of 2012 stays as positive and exciting.

Time for Change, VMware here I come

For a little over the past two years I’ve worked for INX who was acquired at the beginning of the year by Presidio.  I’ve enjoyed working for such a great company, my boss and everyone I worked with was just awesome.  Looking back at the last two years I noticed a trend in the types of projects I was doing, there were probably only around 10 involving Virtualization the rest of my time was spent in other areas of the datacenter such as Cisco UCS, storage (mostly NetApp), some Cisco Nexus and even some Microsoft AD and Exchange. I also had quite a bit of downtime which I used to work in my lab to prep for my VCDX.

After completing my VCDX I became incredibly bored, I didn’t have any virtualization projects coming up and I want to focus on VMware technologies.  I joined the Get Me in VMware LinkedIn group and saw a job posting for exactly what I was looking for. I sent in my resume and about a week later got a call from a recruiter inside of VMware. We talked for a little while and it turned out the position was one where I would be reporting to someone I’ve known for a few years, a day or two later I started interviewing and a little over a week later I had an offer which I accepted. I’m joining the Cloud Infrastructure Management (CIM) group inside of VMware’s Professional Services Organization (PSO) as a Senior Consultant. I’ll be working with large enterprise customers designing and deploying vSphere, Site Recovery Manager and vCloud Director solutions.

The decision to leave INX/Presidio and join VMware wasn’t an easy one, I did enjoy my time it really just came down to where I wanted to take my career and VMware is the best fit for what I want to do. My awesome wife was very supportive of my decision even though it involves much more travel. My last day with INX was today 4/13 and I start VMware on 4/23. I’m taking a week off to spend with my family but I couldn’t be more excited to start my new journey.

Fourth time was the charm, VCDX

Yesterday I received the news I’ve been working towards for nearly two years, I have finally obtained VMware Certified Design Expert (VCDX) certification.  I don’t know my number yet, but by looking at the VCDX Directory there are now 75 VCDX’s so I’ll be somewhere between 69 and 75.

I’d like to thank my family for believing in me through this process, without my wife’s support this wouldn’t have been possible.  The VMware Twitter community has also been great at keeping me motivated through this process, the support I received from you all was awesome.  Scott Lowe, Jason Boche, Matt Cowger and Rick Scherer thank you again, you all went above and beyond in helping me prepare for my last defense and I owe you big time. Brian Rice, you were great to work with through this process and your successor has some mighty big shoes to fill.

In the end it took four attempts, but each one was a great learning experience and I wouldn’t change any of it, it’s made me a better architect and improved my confidence in talking with my peers and customers.  Anyone else going through this process that may have been unsuccessful, don’t give up, this is one journey no matter how long it takes it’s well worth it when that congratulations email finally arrives.

 

My Experience – VCAP-DCD 5 Beta

Like a few others this past week at Partner Exchange I sat the beta version of the VCAP-DCD 5. I found the exam to be pretty spot on with the blueprint which is nice. Like Jason Boche stated there’s really not much you can do to study for this exam, is pretty much an experience based exam.

I did finish all questions with about 8 minutes to spare which is much like how I did with the 4 version. 1 tip I can give is to read the question before reading the long scenario, sometimes the scenario really has no bearing on the question/answer.

I really didn’t think it was much different then the version 4 exam, if you design vSphere solutions regularly & passed the the V4 DCD I don’t think you’ll have a problem with the V5.