Friday, July 30, 2010

Event ID: 1411 after demoting domain controllers

While performing some tasks at a client’s office for their directory summer maintenance, I ran into a problem I haven’t encountered for quite some time and figured I blog about it this time.

Scenario:

- Windows Server 2003 is being used

- 4 domain controllers in the environment.

- 2 domain controllers will be demoted and retired.

- 2 virtual machines have been staged and will replace the 2 domain controllers to be decommissioned with the same name and IP.

Actions Performed:

1. Demote DC1.

2. Force replication.

3. Verify replication.

4. Run NTDSUtil to ensure DC was cleaned out.

5. Re-IP and rename new virtual machine with proper name.

6. Promote new virtual machine to DC.

7. Force replication, verify replication.

8. Repeat for 2nd DC.

Problem:

We went ahead and started to review the event logs after replacing the 2 old domain controllers and noticed that 2 of the old domain controllers (not the virtual machines) were logging a lot of event ID: 1411. One of the DCs were logging more errors while the other less but both were complaining about 2 GUID that appeared to belong to the 2 removed domain controllers:

image

image image

------------------------------------------------------

Active Directory failed to construct a mutual authentication service principal name (SPN) for the following domain controller.
Domain controller:
ceb25b3a-7741-4dce-9447-d02f9b0bd526._msdcs.domain.net
The call was denied. Communication with this domain controller might be affected.
Additional Data
Error value:
8589 The DS cannot derive a service principal name (SPN) with which to mutually authenticate the target server because the corresponding server object in the local DS database has no serverReference attribute.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

------------------------------------------------------

As seen in the above screenshots, this doesn’t look good. The following were the troubleshooting steps I did:

1. Open Active Directory Sites & Services to review the NTDS replication objects. Don’t see a reference to these 2 GUIDs.

2. Went into ADSIEdit to look for the repsTo attribute. Don’t see any references to the 2 GUIDs.

3. Forced replication via Replication Monitor. Don’t see any errors or references.

4. Ran DCDiag and NetDiag. No errors.

Everything looked good and based on the following KB: http://support.microsoft.com/kb/938704, it says that KCC will eventually remove these connections so that’s when we decided to wait.

Resolution:

There wasn’t really a resolution as the KB article says, KCC will run again in 24 hours to remove those links and that was what happened. Just so I add a bit of value here for those that may read this, the Event ID you want to wait for that will clear up this error is event ID: 1104.

image

What I noticed was that this required to be logged 2 or more times before the error referencing that GUID was removed:

image

In the above screenshot, you see this event @ 11:47:03, then the error gets logged at 11:50:52, then another event ID 1104 gets logged again. Then after a few more hours, I noticed another 1104 being logged.

image

If I scroll through these events, they are all referencing the same GUID but the output is a bit different:

image

The Knowledge Consistency Checker (KCC) successfully terminated the following change notifications.
Directory partition:
CN=Configuration,DC=domain,DC=net
Destination network address:
ceb25b3a-7741-4dce-9447-d02f9b0bd526._msdcs.domain.net
Destination domain controller (if available):
CN=NTDS Settings\0ADEL:ceb25b3a-7741-4dce-9447-d02f9b0bd526,CN=DC1016\0ADEL:851e0305-2d6c-4016-89dc-fd0a18882b7b,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=domain,DC=net
This event can occur if either this domain controller or the destination domain controller has been moved to another site.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

image

The Knowledge Consistency Checker (KCC) successfully terminated the following change notifications.
Directory partition:
DC=domain,DC=net
Destination network address:
ceb25b3a-7741-4dce-9447-d02f9b0bd526._msdcs.domain.net
Destination domain controller (if available):
CN=NTDS Settings\0ADEL:ceb25b3a-7741-4dce-9447-d02f9b0bd526,CN=DC1016\0ADEL:851e0305-2d6c-4016-89dc-fd0a18882b7b,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=domain,DC=net
This event can occur if either this domain controller or the destination domain controller has been moved to another site.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

image

The Knowledge Consistency Checker (KCC) successfully terminated the following change notifications.
Directory partition:
DC=ForestDnsZones,DC=domain,DC=net
Destination network address:
ceb25b3a-7741-4dce-9447-d02f9b0bd526._msdcs.domain.net
Destination domain controller (if available):
CN=NTDS Settings\0ADEL:ceb25b3a-7741-4dce-9447-d02f9b0bd526,CN=DC1016\0ADEL:851e0305-2d6c-4016-89dc-fd0a18882b7b,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=domain,DC=net
This event can occur if either this domain controller or the destination domain controller has been moved to another site.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

----------------------------------------------------------------------

The errors will eventually go away and it looks like it takes more than 24 hours to do so.

Thursday, July 29, 2010

Exchange 2007 Unified Messaging Auto Attendant unable to transfer to OCS 2007 R2 Response Group

I was at a client’s office early this week to migrate a department to OCS moving them off of a BCM 400. The client’s a law firm and wanted to set up a call tree allow inquires to the “Personal Injury” department to flow through 2 groups of employees:

Group 1

Employees: 3 x PI (Personal Injury) students

Selection: Round robin

Duration: 20 seconds

-------------------------------------------------------

Group 2

Employees: Associate #1, Associate #2, Associate #3.

Selection: Serial

Duration: 20 seconds

--------------------------------------------------------

If none of these people answer the call, forward call to voicemail.

--------------------------------------------------------

Easy enough right? So I proceeded to configure the RSG contact object, agents, queues, workflow for 2 Basic Hunt Groups with the respective settings and assigning extension 852 to Group 1 and 853 to Group 2. I then created a generic AD account, enabled it for email (you cannot enable a user for UM unless they are enabled for messaging: http://terenceluk.blogspot.com/2010/07/enabling-user-for-exchange-2007-unified.html), and then assigned it the extension 854. Prior to actually changing the auto attendant to include an announcement for this extension, I dialed the auto attendant, punched in 852 to try and test the flow only to find that I get a:

“Sorry, I am unable to find the person with that extension…”

From here on, I logged into MOC with a user enabled for enterprise voice and hit 852 then dialed and it works. Then I thought about the EUM attribute for each user that is enabled for UM and remembered that if that field is not populated, AA will not find that user. RSG contacts don’t have such attributes because they don’t even show up in Exchange 2007’s management console recipients.

image

I proceeded to do some searches and found a few forum posts that this isn’t going to work.

The workaround? Configure the generic mailbox with the extension 854 to forward to extension 852 (Group 1). Configure group 2 to forward directly to voicemail of the generic mailbox’s SIP address.

Wednesday, July 28, 2010

Problems with Office Communicator handling extensions (disappearing phone number)

I ran into an interesting problem the other day while at a client’s office when I asked one of the IT managers about problems he has had with OCS 2007 R2.

One of the features I’ve always liked was the Outlook contact integration with the MOC client. I can look up phone numbers of the fly when I need to make a call outbound:

image

The lookup work’s great. It pulls information from my Outlook contact list and allows me to dial:

image

image

Ok, but what happens when I put in an extension?

image

Once I put in an extension, my MOC client now longer shows the phone number.

image

I haven’t really had the time to research this to see if there was a solution but definitely one of the issues I’ve never noticed.

Here’s the build information:

Microsoft Office Communicator 2007 R2

image

Version 3.5.6907.196

Outlook 2007

image

Version 12.0.6514.500 SP2 MSO 12.0.6535.5002

Updated on July 29, 2010

I posted this question on the Microsoft partner forums and was asked to try putting “X 243” in my Outlook contact but the number still didn’t show up. I actually went through this exercise with the client already asking him to try “X 243” and “X243” without any luck.

image image

Tuesday, July 27, 2010

GALGRAMMARGENERATOR.exe with the -a switch does not work as advertised

I’ve been trying to troubleshoot an ongoing issue at a client’s office when they try to execute “galgrammargenerator.exe -u” and it would not successfully complete. I will post more information on that when I get it resolved but while working with a Microsoft engineer, I found out that the “-a” switch doesn’t work as advertised. Here’s what the help says when you execute “galgrammargenerator.exe ?”:

-a: generate grammar scoped to the AddressList or OrganizationalUnit whose Identity is given in the parameter to this option.

Basically what I did with this switch was:

C:\Program Files\Microsoft\Exchange Server\Bin>GALGRAMMARGENERATOR.exe -u -a ou=users,ou=someOU,dc=someValue,dc=someDomain,dc=com

When I found that the problem I was encountering still existed even though I’ve limited the scope to a specific OU which did not contain the object that the logs were showing, I emailed the Microsoft engineer asking why. Here’s his response:

Also, I understand that you attempted to run the GalGrammarGenerator.exe tool with the /a switch in order to omit the "Auto Attendant", but the issue still occurs. First, the syntax you are using is correct. However, based on my test, even if an OU is specified by this switch, the DTMF map of all users within the organization will be updated.

So it looks like the switch doesn’t really work as advertised.

I’ll post back when I finally get the issue resolved.

Thursday, July 22, 2010

Addressing the VirtualCenter service not starting because SQL service hasn’t started yet

As mentioned in the previous post (http://terenceluk.blogspot.com/2010/07/vcenter-virtual-center-service-fails-to.html), if you install a SQL instance onto the same server as VirtualCenter / vCenter, sometimes the SQL database does not start and load the vCenter database fast enough before vCenter attempts to start. One of the ways to address this is to add the SQL server service as a dependency to the VirtualCenter service (I always struggle trying to figure out whether I should call it VirtualCenter or vCenter service since it’s still technically named VirtualCenter in the services console in version 4.0 Update 2).

image

To add the SQL service as a dependency isn’t as straight forward as I originally thought because what I attempted initially was just open up the service’s properties and try to do it through the GUI but as it turns out, you can’t add it like that. The following is what you need to do:

1. Start –> Run –> Regedit and navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services.

2. Find the name of the service that we want VirtualCenter to depend on. In our case, it’s MSSQLServer.

image

3. Locate the key for VirtualCenter (it’s named vpxd).

image

Notice the DependOnService key that’s type: REG_MULTI_SZ.

4. Open up the properties of the DependOnService key.

image

5. Add an additional line and type in: MSSQLSERVER.

image

6. Hit OK and close the registry editor.

Now when you navigate into the services console to check the VirtualCenter service, you will see the following:

image

If you’re still having problems with the service starting upon restart, see my previous blog post:

vCenter / Virtual Center Service fails to start with event ID: 1000, 7024, 7001, 18456

http://terenceluk.blogspot.com/2010/07/vcenter-virtual-center-service-fails-to.html

vCenter / Virtual Center Service fails to start with event ID: 1000, 7024, 7001, 18456

After having all the issues with distributed network switches and Nexus 1000v switches and vCenter / Virtual Center’s database on a SQL instance that’s on a different server, I decided to recommend to the client that we proceed with installing SQL onto the vCenter server’s operating system and use that to host the database. Since we’ve moved to this configuration, we noticed that the VirtualCenter and VirtualCenter Management Webservices services would never start upon a reboot:

image

I went ahead to review the logs and found the following:

Event ID: 1000

The description for Event ID 1000 from source VMware VirtualCenter Server cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

Failed to intialize VMware VirtualCenter. Shutting down...

the message resource is present but the message is not found in the string/message table

image

Event ID: 7001

The VMware VirtualCenter Management Webservices service depends on the VMware VirtualCenter Server service which failed to start because of the following error:
The service has returned a service-specific error code.

image

Event ID: 7024

The VMware VirtualCenter Server service terminated with service-specific error The system cannot find the file specified..

image

Event ID: 18456

Login failed for user ‘blah/svc_vmvc’. Reason: Failed to open the explicitly specified database.

[CLIENT: <local machine>]

image

Through reviewing the logs, it’s obvious that the database isn’t ready when the services are trying to start and therefore fails to do so. After having experienced something similar before with another client, I decided to simply set the VirtualCenter service to depend on the MSSQLServer service and since the VirtualCenter Management Webservices service depends on the VirtualCenter service, it will wait till until it has started before starting itself.

(For instructions, see the next blog: http://terenceluk.blogspot.com/2010/07/addressing-virtualcenter-service-not.html)

I went ahead to reboot the server after I modified the registry and added the dependencies only to find that the problem still continued to happen. I’m unsure as to why but I decided to try and change the service from Automatic:

image

…to Automatic (Delayed Start):

image

Once I completed this and restarted the server, the services began to start properly. There was one caveat though, and it was the VirtualCenter service starting properly 2 to 3 minutes after the server is reachable. In case the person reading this doesn’t know, the reason for this is because setting a service to Automatic (Delayed Start) means that it will wait till all of the other Automatic services are completed started before it will start and the Service Control manager also sets the priority of the initial thread for these delayed services to THREAD_PRIORITY_LOWEST.

Wednesday, July 21, 2010

Enabling user for Exchange 2007 Unified Messaging voicemail but not email

I came across an interesting question from a client a few months ago. The client was a small 100 user law firm that needed to mimic what their Nortel BCM 400 was able to do and that was to create a general voice mailbox. His question was whether he can just enable a user account named “General Voice Mailbox” for Exchange 2007 UM but not email.

The GUI obviously doesn’t allow you to do that since it only displays users with mailboxes so I went ahead and asked a Microsoft engineer in the Partner Forums and was told that this wasn’t possible. Here’s the answer I got:

Based on my research, we are not able to enable UM for a user object without enabling mail. The UM features are based on the basic mail features provided by the Exchange servers. For example, the Missed Called Notification is sent via e-mail. If a user object is not mailbox-enabled, this user is not able to receive the notification.

What I ended up doing was enabling the user for messaging then hiding the user from the GAL and set the account to not accept messages from anyone.

I did some searches over the internet when I was trying to figure this out and wasn’t able to find any hits so I figure I post here in case someone does the same in the future.

Tuesday, July 20, 2010

What happens when a VMware ESX host loses redundant fibre channel (FC) links to a datastore

I’ve been fortunate enough to be involved with a project for a law firm to deploy 10 ESX hosts at 2 co-located / geographically dispersed site for DR (disaster recovery). Other than all the other technologies I got to worth with: EMC, VMware SRM, vSphere, the list goes on (I love this datacenter virtualization stuff), I came across an interesting discovery during the testing phase. I was responsible for testing all of the ESX clusters and its redundancy whether it was network or storage so I had to create a test plan. Other than the test cases for HA, DRS, Nexus 1000v (what a disaster) and all the bells and whistles of the technologies involved, I had one test case that revealed something completely new to me and this test was the FC paths to storage. We used EMC PowerPath to provided the redundant FC links to the fabric switches and I included the following test case:

CategoryTestCommand / ProcedureExpected BehaviorResultNotes
VMware - HA / EMC StorageHA Virtual machine restartChoose a host with a test virtual machine, disconnect both FC cable, ensure virtual machine restarts on another host.

As shown in the above table, I anticipated that once the ESX host loses both its paths to the fabric and thus losing connectivity to the datastore, VMware HA would restart the virtual machine on the other host. This did not happen and here’s how it would look if you did the test I have above:

1. Once you disconnect the 2 fc cables, navigate to the host’s Configuration tab –> Storage Adapters and click on the vmhba, you will see that all the other datastores are gone aside from which ever store a powered up virtual machine resides on.

image

2. Clicking on the Paths tab will show the following:

image

3. Great, so the path is indicated as dead. So what does the information for the virtual machine show?

image

4. Interestingly enough, the testvm still shows that it’s powered on at the host that lost all of its FC paths to the EMC SAN. So what happens if I try to open the console window?

image

5. Here we see a “Unable to connect to the MKS: Virtual machine config file does not exist..” message. No surprise here, the host did lose access to the virtual machine files.

So long story short, I went ahead and posted a question on the VMware community forums and someone responded telling me HA doesn’t restart virtual machines for fc connections. This didn’t surprise me as the training course I’ve been in for 3.5 always talked about “host isolation” or host actually down. Then I went ahead and did some tests with the VM Monitoring feature that monitors the heartbeat via VMware tools to see if that would restart it and found out that it does indeed restart the virtual machine only:

1. When you reconnect the FC connection.

2. It will restart it on the same host.

The post I wrote on the forum haven’t gotten a lot of responses from other users in the community and google searches don’t appear to yield much results either (or maybe I’m not typing in the right words to search) but I think I’ve come to believe that there is no solution for this unless it’s some form of manual scripting (possibly a forced reboot of the host) with monitoring (the fc links).

My colleague that was with me went ahead and reached out to his ex-coworker in the datacenter field and surprisingly he thought that VM Monitoring would restart it. I’m sure it wouldn’t because I left the links disconnected for 30 minutes and confirmed that the virtual machine was indeed off by reviewing the Windows event logs and seeing that the logs had a 30 minute gap between events.

Monday, July 19, 2010

Problems when updating Client's UCS Firmware

One of the emails I sent out after completing my first firmware update:

Some other notes worth mentioning during the firmware update:

Updating Passive Fabric

As per the document with instructions on how to update the firmware step-by-step: http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/upgrading/from1.1.1/to1.2.1/UpgradingCiscoUCSFrom1.1.1To1.2.1_chapter4.html

While performing the following step:

Activating the Fabric Interconnect Firmware for a Cluster Configuration

Activating the Firmware on a Subordinate Fabric Interconnect to Release 1.2(1)

Once I brought the firmware version from 1.1 to 1.2, Fabric B (passive) threw an IOM 1 error on Chassis 2. When navigating to the “High availability” status of Fabric B, the Ready value was No but the State was Up. The description of the problem was: chassis configuration incomplete. When I view the properties of IOM 1 on Chassis 2, the Faults tab indicates that the module was removed. I checked the status of the failed IOM and noticed all the servers were in the failed state. I confirmed that all the 4 servers in Chassis 2 were offline as I was not able to KVM or ping the service console IP of the 4 ESX servers.

The document basically states the following:

Step 9

Verify the high availability status of the subordinate fabric interconnect.

If the High Availability Status area for the fabric interconnect does not show the following values, contact Cisco Technical Support immediately. Do not continue to update the primary fabric interconnect.

Field Name

Required Value

Ready field

Yes

State field

Up

I was a bit worried that I’ll have to call Cisco tonight but as it turns out, after 5 to 10 minutes or so, the missing IOM came back and the Ready field is now Yes on Fabric B.

Another note I’d like to make is that updating the fabric takes a lot of time. Don’t sit around in the Firmware Activation page watching the status as Activating because you can view a progress status with a % in the Fabric’s properties page.

Updating Active Fabric

5 minutes into activating the active fabric, I got kicked out of UCSM. I was able to reconnect to UCSM via the passive Fabric but upon connecting, Chassis 1 and 2 and Fabric A and B were all highlighted in red meaning there are faults. After waiting around for 5 minutes, they started turning orange and yellow indicating that they’re slowly getting back to better health. While the status of Chassis 1 and FI A was still yellow/orange, I tried to ping the service console of one of the ESX blades on Fabric A and was not able to get a reply. I got a reply when I pinged the blades on Fabric B though.

I guess it’s safe to say that as long as the active fabric is getting updated, the servers will be disrupted:

clip_image002

Once the activation was complete with the activating status as Ready:

clip_image004

… I experienced the same situation as I did with updating fabric B where UCSM would display an error indicating that IOM 1 on Chassis 1 was missing:

image

clip_image008

clip_image010

What was interesting this time even though it does make sense since Fabric A is the primary is that an additional IOM, IOM 2 on Chassis 2, is also indicated as failed/missing:

clip_image012

The status gradually switches from red to orange then to yellow and finally green. When update for fabric A was finally completed, I noticed that it was now the subordinate:

clip_image014

The whole exercise of updating the firmware took more than an hour and a half to complete so remember accommodate enough time to complete these updates in the future.

Problems with TFTP and FTP for UCS Firmware Update

Here’s an internal email I sent out to our UCS guys when I completed my first firmware update from 1.1(1l) to 1.2(1d):

I just wanted to give you a heads up that I had a lot of issues with having UCSM download an updated firmware package via TFTP or FTP. I was able to get the package downloaded via FTP at the end but thought I’d share the strange behavior I experienced in case you guys have to go through this exercise in the future.

Client’s UCS Firmware: 1.1(1l)

image

Latest UCS Firmware Bundle: 1.2(1d)

TFTP / FTP Software: 3CDaemon

TFTP Problem

As we all already know, TFTP does not require user credentials and it was the first method I tried but what I noticed was that UCSM would be able to connect to the TFTP server but would eventually fail. When reviewing the logs on 3CDaemon, I would see the following:

clip_image004

Looking deeper into the debugging logs of 3CDaemon, we will find the following entries:

May 31, 2010 15:20:13 Session 4, Peer 10.20.60.51 Retry. Block = 1. Retries left = 9

May 31, 2010 15:20:13 Session 2, Peer 10.20.60.51 Retry. Block = 1. Retries left = 7

May 31, 2010 15:20:13 Session 3, Peer 10.20.60.51 Retry. Block = 1. Retries left = 8

May 31, 2010 15:20:13 Session 4, Peer 10.20.60.51 TFTP: Thread: 648 Retry. Block = 1. Retries left = 9

May 31, 2010 15:20:13 Session 2, Peer 10.20.60.51 TFTP: Thread: 424 Retry. Block = 1. Retries left = 7

May 31, 2010 15:20:13 Session 3, Peer 10.20.60.51 TFTP: Thread: 524 Retry. Block = 1. Retries left = 8

May 31, 2010 15:20:13 Session 5, Peer 10.20.60.51 Client requests GET of file C:\UCS\ucs-k9-bundle.1.2.1d.bin.

May 31, 2010 15:20:13 Session 5, Peer 10.20.60.51 TFTP: Thread 716, Sent 512 bytes, Block Number = 1

May 31, 2010 15:20:13 Session 1, Peer 10.20.60.51 Retry. Block = 1. Retries left = 6

May 31, 2010 15:20:13 Session 1, Peer 10.20.60.51 TFTP: Thread: 380 Retry. Block = 1. Retries left = 6

May 31, 2010 15:20:18 Session 3, Peer 10.20.60.51 Retry. Block = 1. Retries left = 7

May 31, 2010 15:20:18 Session 5, Peer 10.20.60.51 Retry. Block = 1. Retries left = 9

May 31, 2010 15:20:18 Session 2, Peer 10.20.60.51 Retry. Block = 1. Retries left = 6

May 31, 2010 15:20:18 Session 4, Peer 10.20.60.51 Retry. Block = 1. Retries left = 8

May 31, 2010 15:20:18 Session 3, Peer 10.20.60.51 TFTP: Thread: 524 Retry. Block = 1. Retries left = 7

May 31, 2010 15:20:18 Session 5, Peer 10.20.60.51 TFTP: Thread: 716 Retry. Block = 1. Retries left = 9

May 31, 2010 15:20:18 Session 2, Peer 10.20.60.51 TFTP: Thread: 424 Retry. Block = 1. Retries left = 6

May 31, 2010 15:20:18 Session 4, Peer 10.20.60.51 TFTP: Thread: 648 Retry. Block = 1. Retries left = 8

May 31, 2010 15:20:18 Session 1, Peer 10.20.60.51 Retry. Block = 1. Retries left = 5

May 31, 2010 15:20:18 Session 1, Peer 10.20.60.51 TFTP: Thread: 380 Retry. Block = 1. Retries left = 5

I’ve tried cancelling the download job in UCSM and restarting probably 10 times after changing Windows security permissions and folder locations without any luck. What’s strange is that I had no issues downloading the file with Windows’ TFTP client. Seeing how I haven’t made any progress, I went ahead and moved onto FTP.

FTP Problem

As soon as I moved onto trying to download via FTP, I noticed that the connection would continuously say Session closed by peer.

clip_image006

Reviewing the 3CDaemon logs every time it logs the status above would indicate something different. What’s strange is that there are times when UCSM’s firmware download status would stay @ 2% and not error out. I tried anonymous access but wasn’t able to get that going either. What I ended up doing was redo-ing the firmware download process in UCSM again, saw the Session closed by peer message, then left the computer for a few hours. To my surprise, the package downloaded when I got back to the computer.

Unfortunately, I’m not exactly sure if some of my changes fixed some problems that was preventing the package from being downloaded. With that being said, one of the things I’m sure of is that the download process gets stuck at 2% for a long time before bytes get transferred so I would say give it at least 30 minutes before terminating the session if you suspect something is wrong.

VMware Site Recovery Manager Installation – “Failed to register service”

Ran into an issue with installing Site Recovery Manager where after an unsuccessful installation, the wizard rolls back but does not unregister the service in Windows.

image

As shown in the screen shot above, the service was registered and the unsuccessful install has it set to “Disabled” but does the rollback did not unregister it. This leads me to suspect that the installation basically goes something like this:

1. Install binaries

2. Register service and set it to disable

3. Set service to automatic

4. Start service

I think due to the way in which my installation failed, the rollback did not remove the service and prior to realizing this, I began the install again only to be prompted with: “Failed to register service” near the end of the installation.

To solve this issue, all I needed to do was go into the registry and remove the service:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\vmware-dr

image

Once I removed the “vmware-dr” registry key and kicked off the install again, the installation was successful:

image

Site recovery manager version: VMware-srm-4.0.0-192921

Windows version: Windows 2008 R2 64-bit Standard

Friday, July 16, 2010

Installing VMware Site Recovery Manager 4.0.0-192921 on Windows Server 2008 64-bit R2: Hangs on “Please wait while setup is initializing"…”

I ran into an interesting issue while installing VMware Site Recovery Manager version 4.0.0-192921 on Windows Server 2008 64-bit R2 where it would hang on “Please wait while setup is initializing"…” The setup was left for at least 10 minutes without intervention but the progress bar doesn’t move:

image

image

I was able to see the “msiexec” process in task manager but didn’t see any CPU utilization so I decided to choose “End Process Tree”.

image

After rebooting the server once and re-trying a few times without any luck, I decided to try to right-click and choose “Run as administrator”.

image

image

This finally worked and we were able to install SRM at a DR site:

image