Mike's Virtualization Blog

Script to use with SRM to clone a Domain Controller

Every time I go into a Site Recovery Manager engagement and we start talking about how the test functionality works I always get the question of how to make a domain controller available to the test network. I always say I have a script for that, so I thought I’d share it with anyone who might need it.

The script below is pretty simple, it just clones an existing domain controller, changes its network connection to the test network’s portgroup and then boots it up.

#Add PowerCLI Snapin
Add-Pssnapin VMware.VimAutomation.Core

#Connect to vCenter
connect-viserver <vCenter Name>

#Clone DC
new-vm -name <NewVMName> -datastore <DataStoreName> -VM <SourceVM> -ResourcePool <ClusterName>

#Change Portgroup
get-vm <NewVMName> | get-NetworkAdapter | Set-NetworkAdapter -NetworkName <IsolatedPortGroupName> -Confirm:$false

#Sleep for 1 Minute
start-sleep -s 60

#Power on DC
start-vm -VM <NewVMName>

<vCenter Name> is the name of the vCenter to connect to. If having SRM run this script make sure the service account SRM is running under has the required vCenter permissions.

<NewVMName> is the name you want to use for the cloned DC.

<DataStoreName> is the name of the datastore you want to store the clone in.

<SourceVM> is the DC you want to clone.

<ClusterName> is the cluster or resource pool you want to cloned DC to run in.

<IsolatedPortGroupName> is the name of the portgroup you want the cloned DC to use.

And that’s it, after running or having SRM run this short script you’ll have an isolated DC from your DR site connected to your isolated SRM test network. Just be sure and delete this DC when you’re done with your tests.

SRM with Hitachi Storage

I’m currently at a client who is moving their SRM protected VM’s onto Hitachi VSP arrays. It’s been close to 6 years since I’ve done any work with Hitachi and replication so I assumed it would have gotten better since then, unfortunately it hasn’t, it’s exactly the same as it was.

The storage admin, thankfully, is pretty sharp and had replication (or as Hitachi calls it , pairs) setup already. We installed HORCM (Hitachi Online Remote Copy Manager) on the SRM servers which contains the commands used by the SRA. This requires way more manual editing of configuration files then should be necessary. Then installed the Hitachi SRA on both servers.

The goal of this SRM implementation is to enable fail-over from site a to b and to use the re-protect option to reverse the replication to enable fail-back to site a, currently site b is only a recovery site. Testing the recovery plans is also required by using Hitachi’s Copy-on-Write Snapshots.

After creating the horcm configuration files we threw a test VM onto one of the replicated datastores and started to test and immediately ran into errors.

By looking through the SRM logs it was obvious the horcm config files were our issue. After a couple hours of troubleshooting we finally got everything sorted out and working for all of our requirements. Surprisingly (to me at least) is that there is really no solid “Here’s how to configure SRM with Hitachi Storage” guides or blog posts out there, don’t even get me started on how bad Hitachi’s documentation is. So here’s the quick and dirty of what you need.

I’m assuming you already have vCenter and SRM installed in both sites with the SRM servers paired and the storage admin has used the paircreate command to create the required pairs.

  • Install HORCM to the default location (C:\HORCM).
  • Add C:\HORCM\etc to the systems path.
  • Install the Hitachi SRA, there are no options just press next a few times.
  • According to Hitachi’s documentation, two system (not user) environment variables need to be set on both SRM servers, SplitReplication=True and RMSRATMU=1, we found setting RMSRATMU=1 was one of our issues so we removed it.
  • Reboot the SRM servers.
  • The SRM server needs a command device, if the SRM servers are VM’s (which I would hope they are) you’ll need to create a RDM in physical compatibility mode and map it to the VM, let windows write a signature to the device but don’t format it. I found conflicting information on the size it should be, ranging from 30-40 MB so use 50 to be safe.
  • To issue commands HORCM needs to be running as a service, this is done by copying the horcm_run.txt from c:\HORCM\tool to horcmX_run.txt where X is the number for this instance, edit the copied file to set the HORCMINST variable save and close the file and run C:\HORCM\tool\svcexe /S=HORCMX /A=C:\HORCM\Tool\svcexe.exe again where X is the instance number. This needs to be done on both SRM servers.

Notes about HORCM instances, if you want to use Hitachi’s Copy-on-Write Snapshots you need two HORCM instances on the recovery site, the instance ID for the snapshots must be +1 of the replicated LUNS (LDEV’s as Hitachi calls them) so if you used HORCM10 for the instance for the replicated LUNS you must use HORCM11 for the snapshots or running test fail-overs will not work.
Once the services have been installed you must create the horcmX.conf files and place them in C:\HORCM\etc, again X is the HORCM instance ID.

Creating these files correctly is the hardest part of this config. Below are samples that should work for most instances that meet the same requirements we had.
Assuming I used HORCM instance ID 10 on the protected site (I’m going to reserve 11 in case the requirements for using Copy-on-Write in this site change later) and 12 and 13 in the remote site I’d have these 3 conf files in their respective C:\HORCM\etc directories. Be sure to change the IP addresses, CMD device, Serial# and LDEV#’s to match your environment.

horcm10.conf (protected site)

#/************************* For HORCM_MON *************************************/
#ip_address     service         poll(10ms)     timeout(30ms)
Local IP        horcm10          1000           3000

#/************************** For HORCM_CMD ************************************/
#dev_name                dev_name                dev_name

#/************************** For HORCM_LDEV ***********************************/
#dev_group          dev_name    Serial#   CU:LDEV(LDEV#)   MU#
srm1                vm1         11111     0x111a             
srm1                vm2         11111     0x1110             
srm1                vm3         11111     0x1111             

#/************************* For HORCM_INST ************************************/
#dev_group  ip_address  service
srm1        Remote IP   horcm12

horcm12.conf (recovery site)

#/************************* For HORCM_MON *************************************/
#ip_address     service         poll(10ms)     timeout(30ms)
Local IP        horcm12          1000           3000

#/************************** For HORCM_CMD ************************************/
#dev_name                dev_name                dev_name

#/************************** For HORCM_LDEV ***********************************/
#dev_group      dev_name        Serial#   CU:LDEV(LDEV#)   MU#
srm1                vm1         11112     0x1003             
srm1                vm2         11112     0x1004             
srm1                vm3         11112     0x1005             
srm1-snap       vm1-snap        11112     0x1003            0 
srm1-snap       vm2-snap        11112     0x1004            0 
srm1-snap       vm3-snap        11112     0x1005            0 

#/************************* For HORCM_INST ************************************/
#dev_group      ip_address      service
srm1            Remote IP       horcm10
srm1-snap       Local IP        horcm13

horcm13.conf (recovery site)

#/************************* For HORCM_MON *************************************/
#ip_address     service         poll(10ms)     timeout(30ms)
Local IP        horcm13          1000           3000

#/************************** For HORCM_CMD ************************************/
#dev_name                dev_name                dev_name

#/************************** For HORCM_LDEV ***********************************/
#dev_group      dev_name        Serial#   CU:LDEV(LDEV#)   MU#
srm1-snap       vm1-snap        11112     0xa000             
srm1-snap       vm2-snap        11112     0xa001             
srm1-snap       vm3-snap        11112     0xa002             

#/************************* For HORCM_INST ************************************/
#dev_group          ip_address      service
srm1-snap           Local IP        horcm12

Notice that only horcm12 has any data in the MU (Mirror Unit) column and only for the snapshot devices. If this is blank running Recovery Plans in Test mode will fail, if there’s a value for every device everything (including adding the Array Manager) will fail.

Once these files are created and saved to C:\HORCM\etc you can start the HORCMX service(s) on each SRM server.

The configuration of the SRA is pretty straight forward, you add an Array Manager, give it a name for the site you’re working on and since HORCM is local to each server in the first field enter HORCMINST=X where X is the local HORCM instance for the replicated LUNS and then enter the username and password that’s been setup for SRM to use.

If your HORCM config files are correct the Array Manager will be added, repeat for the other site and run your normal SRM tests.

Issue with Host Based Replication can cause hostd to panic

I’m currently working on a SRM 5.0 project. This is the end of week 5 and up until yesterday everything was going perfectly.

A little background. This environment is running vSphere 5.0 Update 1b, vSphere 5.1 isn’t an option because of lack of support from Symantec on NetBackup. The client wants to use vSphere Replication, AKA Host Based Replication.

Yesterday the client noticed some of his hosts disconnecting from vCenter. The hosts were still available via SSH and all VM’s were still running perfectly fine. Restarting the management agents had no effect, the hosts would not reconnect to vCenter. At random times the hosts would disconnect and later reconnect on their own. Looking through the hostd log we discovered the issue:

2012-10-18T16:25:54.241Z [296C5B90 panic ‘Default’]
–> Panic: Assert Failed: “_quiescedType == quiescedType” @ bora/vim/hostd/hbrsvc/ReplicationGroup.cpp:3505
–> Backtrace:
–> [00] rip 1bfabb43
–> [01] rip 1be035be
–> [02] rip 1bfa1b00
–> [03] rip 1bfa1c12
–> [04] rip 1bd9c036
–> [05] rip 057ddcc5
–> [06] rip 057ddf49
–> [07] rip 057de526
–> [08] rip 057b1f20
–> [09] rip 057b26ba
–> [10] rip 057b9f96
–> [11] rip 1bd7e78a
–> [12] rip 1bd78b1c
–> [13] rip 1bd79556
–> [14] rip 1bd7ba28
–> [15] rip 052d8501
–> [16] rip 1bfce3e1
–> [17] rip 1bfc9533
–> [18] rip 1bfca0d8
–> [19] rip 052d8501
–> [20] rip 1bfbe679
–> [21] rip 1c676852

After opening a case with GSS we learned that if a VM that is being replicated with multiple vmdk’s while each vmdk is in a different state, for example: one disk is done replicating while the other is not, and a state change on the VM occurs such as a power cycle or a snapshot create or delete the replication manager on the host incorrectly assumes all vmdk’s are in the same state and when they’re not it causes hostd to panic, this condition will continue until replication of the VM is complete or you reboot the host. The state change here that is triggering this is snapshots created by NetBackup. ¬†This issue is not present in ESXi 5.1 and will be fixed in ESXi 5.0 in a future update.

There isn’t a KB on this (yet) so I wanted to let anyone who maybe seeing this issue know that it’s a known issue and being worked on.


The KB on this issue is now live: http://kb.vmware.com/kb/2030515