Terence Luk: September 2010

Wednesday, September 29, 2010

OCS 2007 R2: “Some calls to and from people outside of your corporate network may not connect due to server connectivity problems… ” message is intermittently displayed to MOC users but calls continue to work

We had a client out West who have been undergoing network infrastructure changes and after completing the majority of the project, they began to notice that an error message intermittently gets displayed for users when they use MOC to call other remote users connected through the Edge server. The error message reads:

Some calls to and from people outside of your corporate network may not connect due to server connectivity problems. Try signing out and signing back in. If this problem continues, contact your system administrator with this information.

Since this appears to only affect calls from the internal network going out through the Edge server, I went straight to the front-end server and ran the following validation tests:

Front-End Server Validation:

Web Conferencing Server Validation:

A/V Conferencing Server Validation:

To eliminate a connectivity issue, I did a simple telnet from the front-end server to the edge server via the port 5062 and the connection establishes. I also checked the static persistent routes on the edge server to ensure the appropriate routes for the 172.x and 10.x subnets are in there.

All of these errors point to a mismatch of what the front-end server anticipates and what is presented. In this client’s case, it’s because the internal interface on the Edge server is currently using the public certificate with the name sip.companyName.com. When the internal front-end server initiates a connection over to the Edge server, it uses the name “xxx-edge-01.someDomain.local”.

To confirm that the edge server is indeed presenting the wrong interface, we can look at what certificate is assigned as shown here in the properties:

I went ahead and double checked the sip.companyName.com certificates and none of them have a SAN entry for the internal name. This means that they should use an internally issued certificate with the proper internal edge server name for the internal interface. The screenshot above shows that there’s actually such a certificate but it’s not assigned.

Resolution

The resolution is simple: all we need to do is assign the internal certificate. I’ve verified the expiry date for the certificate and it hasn’t expired yet.

Once I changed the certificate, the error in the validation went away and users stopped receiving the error message.

Tuesday, September 28, 2010

OCS 2007 R2 Voice Properties’ Location Profile unusable due to invalid translation rule

I ran into an interesting issue a few months back when a client with OCS 2007 R2 called our support services because all of the translation rules in his location profile simply stopped working. The problem was eventually escalated to me so I logged into the client’s network remotely and started looking around. It’s been more than 4 months so I couldn’t remember what I looked at before arrived to the solution so long story short, I eventually opened up Enterprise Voice Route Helper:

…and imported the routing data:

What I was basically looking for was to see if there was a normalization rule put at the top of the rules list that somehow caught all cases. I wasn’t able to spot anything out of the ordinary while scrolling down the list so I tried to perform and ad-hoc test with my cell number and this was where I got luck because I got the following message in the results:

The specified profile contains one or more rules that were invalid.

The profile cannot be used.

As shown in the screenshot above, there was a translation rule entered into the profile with an improper syntax and thus rendered the whole location profile unusable. This was an obvious typo the client made and I found it very interesting that the management console actually accepted it. Everything started working once I fixed the missing bracket.

Monday, September 27, 2010

How to forcefully reassign assigned disks on a NetApp Filer

We had to do some emergency maintenance with a new NetApp shelf a few days ago and found that because we were working with an older version of the firmware, there were commands that did not exist on the filer so we ended up using the GUI and CLI to work around the problem.

Note: Some of the commands may not be necessary but I wanted to list all the steps we had to take to get this to work so I’ve highlighted the steps that is possibly not needed in RED. Please also take note that we had a small window to work with so the instructions below may not abide by best practices.

Task: Disks in the new shelf has been assigned to 2 separate controllers (6 each).

Problem: We could not find the command to unassign the disks in this version of the firmware so we had to use the combination of the GUI and CLI to remove the disk from the controller and then reassign it to another.

NetApp Information

Filer: Some Name

Model: FAS2020

Version: 7.2.6.1

---------------------------------------------------------------------------------------------------------------------------------------------

Initiating a disk show shows the following:

FAC01> disk show
DISK OWNER POOL SERIAL NUMBER
------------ ------------- ----- -------------
0c.00.11 FAC02 (135048330) Pool0 4AD7XX3Q00009925DGCC
0c.00.4 FAC02 (135048330) Pool0 4AD7XX4800009928G98U
0c.00.2 FAC02 (135048330) Pool0 4AD7XX8G00009928PPLH
0c.00.6 FAC02 (135048330) Pool0 3QP0ZC2P00009926HBXK
0c.00.0 FAC02 (135048330) Pool0 4AD7XXC900009928PP3P
0c.00.7 FAC01 (135048291) Pool0 4AD7XXEW00009928GNHA
0c.00.10 FAC01 (135048291) Pool0 4NE20JG400009926HEZS
0c.00.3 FAC01 (135048291) Pool0 4NE20EC100009928G4P0
0c.00.1 FAC01 (135048291) Pool0 3QP0WZT400009926KCBN
0c.00.5 FAC01 (135048291) Pool0 4NE20FK5000099171W7V
0c.00.8 FAC01 (135048291) Pool0 3QP0ZRF100009926HEJT
0c.00.9 FAC01 (135048291) Pool0 4AD7XXHG00009928GDQW
0b.18 FAC01 (135048291) Pool0 XX-XXXXX8038093
0b.22 FAC01 (135048291) Pool0 XX-XXXXX8095587
0b.21 FAC01 (135048291) Pool0 XX-XXXXX7977856
0b.20 FAC01 (135048291) Pool0 XX-XXXXX8039600
0b.24 FAC01 (135048291) Pool0 XX-XXXXX8039114
0b.17 FAC01 (135048291) Pool0 XX-XXXXX8015493
0b.29 FAC02 (135048330) Pool0 XX-XXXXX8095049
0b.23 FAC02 (135048330) Pool0 XX-XXXXX8039807
0b.19 FAC01 (135048291) Pool0 XX-XXXXX7977641
0b.27 FAC02 (135048330) Pool0 XX-XXXXX8039645
0b.28 FAC01 (135048291) Pool0 XX-XXXXX7978194
0b.26 FAC01 (135048291) Pool0 XX-XXXXX8039342
0b.25 FAC02 (135048330) Pool0 XX-XXXXX8039954
0b.16 FAC01 (135048291) Pool0 XX-XXXXX7977866
FAC01>

What we want to do is move disks: ob.17, 0b.19 and 0b.21 off of FAC01 to FAC02.

A disk ? shows the following:

FAC01> disk ?
usage: disk <options>
Options are:
fail [-i] [-f] <disk_name> - fail a file system disk
remove [-w] <disk_name> - remove a spare disk
swap - prepare (quiet) bus for swap
unswap - undo disk swap and resume service
scrub { start stop } - start or stop disk scrubbing
assign {<disk_name> all -n <count> auto} [-p <pool>] [-o <ownername>] [-s <sysid>] [-c blockzoned] [-f] - assign a disk to a filer or all unowned disks by specifying "all" or <count> number of unowned disks
show [-o <ownername> -s <sysid> -n -v -a] - lists disks and owners
replace {start [-f] [-m] <disk_name> <spare_disk_name>} {stop <disk_name>} - replace a file system disk with a spare disk or stop replacing
zero spares - Zero all spare disks
checksum {<disk_name> all} [-c block zoned]
sanitize { start abort status release } - sanitize one or more disks
maint { start abort status list} - run maintenance tests on one or more disks
FAC01>

We tried the remove option as well as the replace but found that just as the description specifies, these commands expect you to be moving the spares around. As we were pressed for time to get the disks reassigned, we went into the GUI to offline these disks by setting them to remove:

After removing these 3 disks from the GUI, a disk show now shows the following:

FAC01> disk show
DISK OWNER POOL SERIAL NUMBER
------------ ------------- ----- -------------
0c.00.11 FAC02 (135048330) Pool0 4AD7XX3Q00009925DGCC
0c.00.4 FAC02 (135048330) Pool0 4AD7XX4800009928G98U
0c.00.2 FAC02 (135048330) Pool0 4AD7XX8G00009928PPLH
0c.00.6 FAC02 (135048330) Pool0 3QP0ZC2P00009926HBXK
0c.00.0 FAC02 (135048330) Pool0 4AD7XXC900009928PP3P
0c.00.7 FAC01 (135048291) Pool0 4AD7XXEW00009928GNHA
0c.00.10 FAC01 (135048291) Pool0 4NE20JG400009926HEZS
0c.00.3 FAC01 (135048291) Pool0 4NE20EC100009928G4P0
0c.00.1 FAC01 (135048291) Pool0 3QP0WZT400009926KCBN
0c.00.5 FAC01 (135048291) Pool0 4NE20FK5000099171W7V
0c.00.8 FAC01 (135048291) Pool0 3QP0ZRF100009926HEJT
0c.00.9 FAC01 (135048291) Pool0 4AD7XXHG00009928GDQW
0b.18 FAC01 (135048291) Pool0 XX-XXXXX8038093
0b.22 FAC01 (135048291) Pool0 XX-XXXXX8095587
0b.21 FAC01 (135048291) FAILED XX-XXXXX7977856
0b.20 FAC01 (135048291) Pool0 XX-XXXXX8039600
0b.24 FAC01 (135048291) Pool0 XX-XXXXX8039114
0b.17 FAC01 (135048291) FAILED XX-XXXXX8015493
0b.29 FAC02 (135048330) Pool0 XX-XXXXX8095049
0b.23 FAC02 (135048330) Pool0 XX-XXXXX8039807
0b.19 FAC01 (135048291) FAILED XX-XXXXX7977641
0b.27 FAC02 (135048330) Pool0 XX-XXXXX8039645
0b.28 FAC01 (135048291) Pool0 XX-XXXXX7978194
0b.26 FAC01 (135048291) Pool0 XX-XXXXX8039342
0b.25 FAC02 (135048330) Pool0 XX-XXXXX8039954
0b.16 FAC01 (135048291) Pool0 XX-XXXXX7977866
FAC01>

Since the disk reassign command can only be run in maintenance mode or during takeover in advanced mode, we executed the disk remove_ownership command but before we can execute that command, we needed to elevate our privileges to advanced:

FAC01> priv set advanced
Warning: These advanced commands are potentially dangerous; use
them only when directed to do so by Network Appliance
personnel.
FAC01*>

Then we executed:

FAC01*> disk remove_ownership 0b.17
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

FAC01*> disk remove_ownership 0b.19
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

FAC01*> disk remove_ownership 0b.21
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y
FAC01*>

The following is the output when we execute a disk show after the above commands were completed:

FAC01> disk show
DISK OWNER POOL SERIAL NUMBER
------------ ------------- ----- -------------
0c.00.11 FAC02 (135048330) Pool0 4AD7XX3Q00009925DGCC
0c.00.4 FAC02 (135048330) Pool0 4AD7XX4800009928G98U
0c.00.2 FAC02 (135048330) Pool0 4AD7XX8G00009928PPLH
0c.00.6 FAC02 (135048330) Pool0 3QP0ZC2P00009926HBXK
0c.00.0 FAC02 (135048330) Pool0 4AD7XXC900009928PP3P
0c.00.7 FAC01 (135048291) Pool0 4AD7XXEW00009928GNHA
0c.00.10 FAC01 (135048291) Pool0 4NE20JG400009926HEZS
0c.00.3 FAC01 (135048291) Pool0 4NE20EC100009928G4P0
0c.00.1 FAC01 (135048291) Pool0 3QP0WZT400009926KCBN
0c.00.5 FAC01 (135048291) Pool0 4NE20FK5000099171W7V
0c.00.8 FAC01 (135048291) Pool0 3QP0ZRF100009926HEJT
0c.00.9 FAC01 (135048291) Pool0 4AD7XXHG00009928GDQW
0b.18 FAC01 (135048291) Pool0 XX-XXXXX8038093
0b.22 FAC01 (135048291) Pool0 XX-XXXXX8095587
0b.21 FAC01 (135048291) FAILED XX-XXXXX7977856
0b.20 FAC01 (135048291) Pool0 XX-XXXXX8039600
0b.24 FAC01 (135048291) Pool0 XX-XXXXX8039114
0b.17 FAC01 (135048291) FAILED XX-XXXXX8015493
0b.29 FAC02 (135048330) Pool0 XX-XXXXX8095049
0b.23 FAC02 (135048330) Pool0 XX-XXXXX8039807
0b.19 FAC01 (135048291) FAILED XX-XXXXX7977641
0b.27 FAC02 (135048330) Pool0 XX-XXXXX8039645
0b.28 FAC01 (135048291) Pool0 XX-XXXXX7978194
0b.26 FAC01 (135048291) Pool0 XX-XXXXX8039342
0b.25 FAC02 (135048330) Pool0 XX-XXXXX8039954
0b.16 FAC01 (135048291) Pool0 XX-XXXXX7977866
FAC01>

It almost looks like nothing was changed and here’s why:

Notice that the message indicates that disk.auto_assign is turned on so in order to have these disks remain unassigned, we need to execute the following:

FAC01*> options disk.auto_assign off
You are changing option disk.auto_assign which applies to both members of
the cluster in takeover mode.
This value must be the same in both cluster members prior to any takeover
or giveback, or that next takeover/giveback may not work correctly.
Sun Sep 26 21:25:53 EST [PHMSFAC01: reg.options.cf.change:warning]: Option disk.auto_assign changed on one cluster node.
FAC01*>

FAC02*> options disk.auto_assign off
You are changing option disk.auto_assign which applies to both members of
the cluster in takeover mode.
This value must be the same in both cluster members prior to any takeover
or giveback, or that next takeover/giveback may not work correctly.
Sun Sep 26 21:25:53 EST [PHMSFAC01: reg.options.cf.change:warning]: Option disk.auto_assign changed on one cluster node.
FAC02*>

Now in order to reassign these disks, we had to

unfail it with the command:

disk unfail <disk name>

disk unfail 0b.19

disk unfail 0b.17

disk unfail0b.21

Here’s what the SSH session looks like:

FAC01*> disk unfail 0b.17
disk unfail: unfailing disk 0b.17...
FAC01*> Sun Sep 26 21:28:16 EST [FAC01: raid.disk.unfail.reassim:info]: Disk 0b.17 Shelf 1 Bay 1 [WDC WD1002FBYS-05ASX NA01] S/N [WD-WMATV8015493] was unfailed, and is now being reassimilated
disk unfail 0b.19
disk unfail: unfailing disk 0b.19...
FAC01*> Sun Sep 26 21:28:23 EST [FAC01: raid.disk.unfail.reassim:info]: Disk 0b.19 Shelf 1 Bay 3 [WDC WD1002FBYS-05ASX NA01] S/N [WD-WMATV7977641] was unfailed, and is now being reassimilated
disk unfail 0b.21
disk unfail: unfailing disk 0b.21...
FAC01*> Sun Sep 26 21:28:27 EST [FAC01: raid.disk.unfail.reassim:info]: Disk 0b.21 Shelf 1 Bay 5 [WDC WD1002FBYS-05ASX NA01] S/N [WD-WMATV7977856] was unfailed, and is now being reassimilated

Now that we’ve unfailed the disks as well as turned off disk.auto_assign, we can execute the remove_ownership command again:

FAC01*> disk remove_ownership 0b.17
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y
FAC01*> disk remove_ownership 0b.19
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y
FAC01*> disk remove_ownership 0b.21
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y
FAC01*> disk show
DISK OWNER POOL SERIAL NUMBER
------------ ------------- ----- -------------
0c.00.11 FAC02 (135048330) Pool0 3QP0YV3Q00009925DGCC
0c.00.4 FAC02 (135048330) Pool0 3QP0YV4800009928G98U
0c.00.2 FAC02 (135048330) Pool0 3QP0YV8G00009928PPLH
0c.00.6 FAC02 (135048330) Pool0 3QP0ZC2P00009926HBXK
0c.00.0 FAC02 (135048330) Pool0 3QP0YVC900009928PP3P
0c.00.7 FAC01 (135048291) Pool0 3QP0YVEW00009928GNHA
0c.00.10 FAC01 (135048291) Pool0 3QP10JG400009926HEZS
0c.00.3 FAC01 (135048291) Pool0 3QP10EC100009928G4P0
0c.00.1 FAC01 (135048291) Pool0 3QP0WZT400009926KCBN
0c.00.5 FAC01 (135048291) Pool0 3QP10FK5000099171W7V
0c.00.8 FAC01 (135048291) Pool0 3QP0ZRF100009926HEJT
0c.00.9 FAC01 (135048291) Pool0 3QP0YVHG00009928GDQW
0b.18 FAC01 (135048291) Pool0 WD-WMATV8038093
0b.22 FAC01 (135048291) Pool0 WD-WMATV8095587
0b.20 FAC01 (135048291) Pool0 WD-WMATV8039600
0b.24 FAC01 (135048291) Pool0 WD-WMATV8039114
0b.29 FAC02 (135048330) Pool0 WD-WMATV8095049
0b.23 FAC02 (135048330) Pool0 WD-WMATV8039807
0b.27 FAC02 (135048330) Pool0 WD-WMATV8039645
0b.28 FAC01 (135048291) Pool0 WD-WMATV7978194
0b.26 FAC01 (135048291) Pool0 WD-WMATV8039342
0b.25 FAC02 (135048330) Pool0 WD-WMATV8039954
0b.16 FAC01 (135048291) Pool0 WD-WMATV7977866
NOTE: Currently 3 disks are unowned. Use 'disk show -n' for additional information.
FAC01*>

Notice how there are 3 disks stated as being unowned now. The final step is to hop over to the controller that you want to assign the disks and execute the following:

FAC02*> disk assign 0b.17
disk assign: Assign failed for one or more disks in the disk list.
FAC02*> disk assign 0b.17
Sun Sep 26 21:39:34 EST [FAC02: diskown.changingOwner:info]: changing ownership for disk 0b.17 (S/N WD-WMATV8015493) from unowned (ID -1) to FAC02 (ID 135048330)
FAC02*> disk assign 0b.19
Sun Sep 26 21:39:39 EST [FAC02: diskown.changingOwner:info]: changing ownership for disk 0b.19 (S/N WD-WMATV7977641) from unowned (ID -1) to FAC02 (ID 135048330)
FAC02*> disk assign 0b.21
Sun Sep 26 21:39:43 EST [FAC02: diskown.changingOwner:info]: changing ownership for disk 0b.21 (S/N WD-WMATV7977856) from unowned (ID -1) to FAC02 (ID 135048330)
FAC02*>

A disk show now shows the 3 disks being assigned to the other active controller.

Sorry about the extra steps I included so they may or may not be required to change assigned disks from one controller to the other.

Reminder: vCenter 4.1 uses 64-bit DSN | vCenter 4.0 uses 32-bit DSN

While performing some troubleshooting with a client a few weeks ago on their vCenter 4.1 server, I learned that vCenter 4.1 actually uses a 64-bit DSN. This meant that we no longer have to go into:

Install the 32-bit client via Microsoft’s downloads
Run 32-bit ODBC via the WoW64 folder
Create a 32-bit DSN

…which you had to do for a vCenter 4.0 install.

The following is the difference between the 2 version’s:

vCenter 4.0

Database Options

Select an ODBC data source for vCenter Server.

vCenter Server requires a database.

Use an existing supported database

Data Source Name (DSN): (Please create a 32-bit system (DSN)

vCenter 4.1

Database Options

Select an ODBC data source for vCenter Server.

vCenter Server requires a database.

Use an existing supported database

Data Source Name (DSN): (Please create a 64-bit system (DSN)

Now I’m just waiting for VUM (Update Manager) 4.1 to support a 64-bit ODBC DSN because it’s currently still a requirement for it to be installed on a 64-bit Operating System yet use a 32-bit DSN.

Update

The following might be helpful when configuring the 64-bit DSN:

Make sure you select SQL Server Native Client 10.0 for the System DSN and not the regular SQL Server or the vCenter installation wizard won’t detect it (you can’t just hit the back button then forward again to see the DSN).

Sunday, September 26, 2010

Example with NetApp for realistic expectations of raw and usable capacity

Warning: I’m not a SAN expert but as I’ve gotten more opportunities to work in datacenter projects, I’m beginning to see more real world SAN implementations and while this doesn’t provide a complete breakdown of what to consider while calculating raw and usable storage, I hope this will at the very least provide some useful information to professionals out there looking for some real world numbers when provisioning SAN storage.

Configuration

Brand: NetApp

Model: FAS2020

Version: 7.2.6.1

RAID: RAID_DP (Double Parity)

RAID Size: 16

Number of Disks: 6

Disk Size: 300GB SAS

Actual usual disk size: 266GB

Total Aggregate Capacity: 908GB

As shown with the information listed above, configuring a FAS2020 with 6 x 300GB SAS drives realistically yields only 908GB for the aggregate. Working out the numbers we can see that:

Specifications on paper: 300GB x 6 disks = 1.8TB

Actual drive capacity: 266 x 6 disks = 1.598TB

Actual useable *aggregate* capacity after RAID_DP: 0.908GB

If we divide the numbers to get the amount of storage space you lose from the overhead such as RAID, we’re actually losing approximately: 51% of drive space. This 51% also does not include the spare disk you’ll need per controller (you need a disk for each controller so if you have an active/active setup, you’ll need 2 disks for each controller). Also don’t forget that the software for the controller also sits in aggregate 0 on the NetApp which means that will take up additional space. As of the year 2009, the NetApp technician told me that a minimum of 10GB is required for the root volume and 20GB is recommended for the FAS2020.

Lastly, as new volumes are created for LUNs, your volume needs more space than the actual LUN and the reason for this is because you will need extra space if you decide to use snapshots. Best practice as told by the NetApp engineer is that you should have 2x + delta (x being the size of the LUN) extra space because it covers the situation if a snapshot is taken of the LUN (let's say LUN was completely full), deleted all information on the LUN, filled it back up with different information but because the 2x + delta was followed, this means that your snapshot can hold all the information prior to deleting the original information. With that being said, as most companies don’t like to lose so much storage, another good practice is to use 1x + delta (x being the size of the LUN).

There are times when thinking about all the reasons that contribute to lost storage in exchange for redundancy often scares me so I find that it’s ever so much more important to communicate to customers all the variables and set their expectations appropriately.

-------------------------------------------------------------------------------------------------------------------------------------------------------

The following is another example similar to the configuration above but with 1TB drives:

Configuration

Brand: NetApp

Model: FAS2020

Version: 7.2.6.1

RAID: RAID_DP (Double Parity)

RAID Size: 6

Number of Disks: 6

Disk Size: 1TB SAS

Actual usual disk size: 828GB

Total Aggregate Capacity: 2.76TB

As shown with the information listed above, configuring a FAS2020 with 6 x 1TB SAS drives realistically yields only 2.76GB for the aggregate. Working out the numbers we can see that:

Specifications on paper: 1TB x 6 disks = 6TB

Actual drive capacity: 828 x 6 disks = 4.968TB

Actual useable *aggregate* capacity after RAID_DP: 2.76TB

If we divide the numbers to get the amount of storage space you lose from the overhead such as RAID, we’re actually losing approximately: 54% of drive space. As indicated in the example above, this 54% also does not include other variables that will contribute to more loss in storage space.

-------------------------------------------------------------------------------------------------------------------------------------------------------

Here’s another example similar to the first one but with 4 disks instead:

Configuration

Brand: NetApp

Model: FAS2020

Version: 7.2.6.1

RAID: RAID_DP (Double Parity)

RAID Size: 16

Number of Disks: 4

Disk Size: 300GB SAS

Actual usual disk size: 266GB

Total Aggregate Capacity: 454GB

As shown with the information listed above, configuring a FAS2020 with 4 x 300GB SAS drives realistically yields only 454GB for the aggregate. Working out the numbers we can see that:

Specifications on paper: 300GB x 4 disks = 1.2TB

Actual drive capacity: 266 x 4 disks = 1.02TB

Actual useable *aggregate* capacity after RAID_DP: 454GB

If we divide the numbers to get the amount of storage space you lose from the overhead such as RAID, we’re actually losing approximately: 63% of drive space. Again, this does not include other contributing factors that will decrease the amount of usable storage even more.

-------------------------------------------------------------------------------------------------------------------------------------------------------

Extras

Random notes I took during troubleshooting with NetApp engineer: There are ways to reclaim space such as snapshots for your aggregates, volumes, reducing factional reserve, reducing snapshot schedules’ frequency but they all contribute to reduced redundancy. Also, make sure a 1 LUN per volume mapping is followed in case a volume ever goes down, not all of your LUNs do. Lastly, make sure auto snap auto delete is turned on because if no space is left for snapshots, the NetApp will delete the old one and take the snapshot. If this was not turned on, the LUN will go offline if it fills up without space reservation.

My thoughts: Being as a consultant means we’re obligated to pass the truth about our knowledge to customers and while this may hard to digest for many clients, it’s important not to overlook why companies purchase SANs in the first place: because they want robust storage that provides redundancy, exceptional recovery time and performance and storage companies design their storage solutions with this as their number 1 priority. I’ve been fortunate enough to be at a training session delivered by Peter Henneberry from NetApp and it was quite the eye opener when he gave us real world statistics on how they can complete backups within the seconds or minutes rather than hours so while I can’t state all the benefits of a SAN, there are plenty of reasons.

I’m not much of storage consultant even though I’d like to get into it a bit more so please forgive any mistakes I have made in this post whether it’s calculations or information I have missed.

hba / vmhba not showing up with ESXi 4.0 / 4.1

I ran into various problems while deploying a new server in an ESX cluster and I hope this post will help anyone who might run into the same problem.

Specifications:

Server: HP ProLiant DL360 G6

BIOS: ProLiant System BIOS – P64 (03/30/2010)

CPU: 2 x Intel CPU X5550 @ 2.67GHz

QPI: Speed: 6.4 GT/s

Onboard NIC: PCI Embedded HP NC382i PCIe DP Multifunction 1GB Adapter Port 1, 2 (IRQ 7, 11)

HBA (Host bus adapter): HP StorageWorks 42B PCI Fibre Channel Adapter IRQ 7, 11

Additional NIC: NC375T PCIe Quad Port Gigabit Server Adapter (IRQ: 7)

------------------------------------------------------------------------------------------------------------------------------------------

Other than having incorrect memory ordered and some other issues, I was unable to have ESXi 4.0.0 see the HP StorageWorks 42B HBA.

As shown in the screenshot below, I was unable to have ESXi 4.0.0 build 261974 see the vmhba after a fresh install.

After doing a few searches on Google and finding various posts about people having problems with installing 2 of the quad port cards but managed to get it going after upgrading the drivers or ESXi, I decided to try upgrading ESXi from 4.0.0 to 4.1.0 (http://terenceluk.blogspot.com/2010/09/updating-vsphere-esxi-from-40-to-41.html) assuming that the newer ESXi would may have the right drivers for the HBA.

The upgrade went without a hitch but I was still unable to see the HBA after getting ESXi to version 4.1.0 so I went back to the internet to do some searches and found that there were ways to install additional drivers for ESXi and coincidentally, you use the VMware vSphere CLI to do it (see my previous post about why this was coincidental). The next step I did was to try and find the HBA drivers. Through reviewing the description the BOM, I find the description:

StorageWorks 42B - Host bus adapter - PCI Express low profile - 4Gb Fibre Channel (SW) - fiber optic - 2 ports

… which lead me to the following page on the HP site @:

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=3954646&prodTypeId=12169&prodSeriesId=1809835&swLang=13&taskId=135&swEnvOID=4040

The driver I thought would work was the:

VMware and Host Connectivity Manager (HCM) Installation Kit

…so I went ahead and downloaded this package:

brocade_driver_esx40_v2-1-1-0.tar.gz

…from: http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=12169&prodSeriesId=1809835&prodNameId=3954646&swEnvOID=4040&swLang=13&mode=2&taskId=135&swItem=co-84517-1

I went ahead to try and execute the package with the command:

C:\Program Files\VMware\VMware vSphere CLI\bin>vihostupdate.pl -server 172.20.70.13 -username root -install -bundle c:\brocade_driver_esx40_v2-1-1-0.tar.gz

…but got the following message:

Please wait patch installation is in progress …

Failed to download metadata.Error extracting metadata.zip from /tmp/updatecache/brocade_driver_esx40_v2-1-1-0.tar.gz: File is not a zip

Seeing how it doesn’t look like the update is expecting a tar.gz file, I tried to extra the .tar.gz file into a folder, then zipping it back up to try again:

Unfortunately, that didn’t work either:

Please wait patch installation is in progress …

Failed to download metadata.Error extracting metadata.zip from /tmp/updatecache/brocade_driver_esx40_v2-1-1-0.zip: “There is no item named ‘metadata.zip’ in the archive”

This was when I went back to the package I downloaded for the ESXi 4.0.0 update to 4.1.0 and found that the package indeed had a metadata.zip package:

Browsing around the package I downloaded from the HP site and not finding a metadata.zip file made me suspect that I probably downloaded some other package that was not meant for ESXi updates.

Realizing that perhaps I should first try to ensure that this HBA was on the HCL list, I went ahead to try and find it on VMware’s HCL:

http://www.vmware.com/resources/compatibility/search_results_ajax.php?action=search&deviceCategory=io&&&&&&&&&&&&partnerId%5B%5D=41&&&&sort=manufacturer~desc&&&ioTypeId=5&&startDisplayRow=50

http://www.vmware.com/resources/compatibility/detail.php?device_cat=io&device_id=16075&release_id=24

Based on what I found on the HCL, it does look like it’s supported so I went back to the VMware downloads section to try and find drivers offered directly from that site. What I found was the following:

http://downloads.vmware.com/d/info/datacenter_downloads/vmware_vsphere_4/4#drivers_tools

I was a bit confused as to which one to download because I was not able to find a 4Gb Fibre Channel HBA option and the only option close to it was the:

VMware ESX/ESXi 4.X Driver CD for Brocade 8Gb Fibre Channel HBA

Since I wouldn’t know if I don’t try, I went ahead to download that package:

http://downloads.vmware.com/d/details/esx_4x_brocade_bfa2111_dt/ZHcqYmRAZCpiZHdlZQ==

Download
VMware ESX/ESXi 4.X Driver CD for Brocade 8Gb Fibre Channel HBA

Description
This driver CD release includes support for version 2.1.1.1 of the Brocade BFA driver on ESX/ESXi 4.0. This BFA driver supports products based on the Brocade 825, 815, 425, and 415 Fibre Channel host bus adapters (HBA).

Version
2.1.1.1

Build Number
285864

Release Date
2010/08/23

Type
Drivers & Tools

Language Support
English

Components
This download contains the following components. Hide Details

ESX/ESXi 4.X Brocade bfa 2.1.1.1 Driver
File type: iso
English

The download gave me the following ISO package:

vmware-esx-drivers-scsi-bfa_400.2.1.1.1-1OEM.285864.iso

I opened the package with WinRAR and found the following:

.rpm

doc

offline-bundle

drivers.xml

TRANS.TBL

Looking into the offline-bundle folder, I found the following package:

BRCD-bfa-2.1.1.1-00000-offline_bundle-285864.zip

Drilling into that .zip package showed that it contains the following:

metadata.zip

vmware-esx-drivers-scsi-bfa-400.2.1.1.1-1OEM.x86_64.vib

This was when I was sure that this package was the one I wanted so i went ahead and extracted the BRCD-bfa-2.1.1.1-00000-offline_bundle-285864.zip package then went back to vSphere CLI to try and update the drivers with it:

C:\Program Files\VMware\VMware vSphere CLI\bin>vihostupdate.pl -server 172.20.70

.13 -username root -install -bundle c:\vmware-esx-drivers-scsi-bfa_400.2.1.1.1-1

OEM.285864\offline-bundle\BRCD-bfa-2.1.1.1-00000-offline_bundle-285864.zip

Enter password:

Please wait patch installation is in progress ...

The update completed successfully, but the system needs to be rebooted for the c

hanges to be effective.

C:\Program Files\VMware\VMware vSphere CLI\bin>

Success!

Once the update completed, I went and fired up VI Client, connected to the host, navigated to the storage adapters section and now I can see the vmhba listed!

Now that I got the HBA to show up on the host, I knew I needed to do one last step and that was to ensure that I did use the right driver because once this goes into production, we won’t be able to do any more testing with it.

Within the storage adapters section, ESXi lists the adapters as Brocade-425/825:

… so I went ahead to look up the reference guide on the Brocade site to determine if there was a cross reference guide available and indeed there was:

http://www.brocade.com/downloads/documents/reference_guides/Brocade_HBA_Cross_Reference_01.pdf

As indicated on the guide in the above screenshot, the Brocade 425 actually corresponds to the StorageWorks 42B HP model.

--------------------------------------------------------------------------------------------------------------------------------------------------

I’m surprised that the ESXi 4.1.0 package did not seem to load this driver during the install so I hope this helps anyone out there that may run into this or a similar problem in the future.

Thursday, September 23, 2010

Updating vSphere ESXi from 4.0 to 4.1 with vSphere CLI

We’re currently in the process of refreshing a client’s VI3 environment to vSphere 4.1 and procured a new server to add to the existing cluster. While performing the ESXi install on the new server, I did not have a ESXi 4.1 CD available so I went ahead to install 4.0 and figure I’d update it with the vSphere Host Update Utility. For those who have read the vSphere 4.1 release probably already know that you cannot update the host from 4.0 to 4.1 with that utility so this post serves as to show how you can update the host with the vSphere CLI.

I started off with using the VMware vSphere Host Update Utility I had installed on my laptop to try and update the ESXi 4.0.0 build-261974.

As shown in the following screenshot, scanning a fully patched ESXi 4.0.0 won’t give you an option to upgrade the host to version 4.1.0.

As per the following release notes:

http://www.vmware.com/support/vsphere4/doc/vsp_esxi41_vc41_rel_notes.html

ESXi Upgrades

vSphere 4.1 offers the following tools for upgrading ESXi hosts:

VMware vCenter Update Manager. vSphere module that supports direct upgrades from ESXi 3.5 and ESXi 4.0 to ESXi 4.1. See the vCenter Update Manager Installation and Administration Guide.

vihostupdate. Command-line utility that supports direct upgrades from ESXi 4.0 to ESXi 4.1. This utility requires the vSphere CLI. See the vSphere Upgrade Guide.

Reviewing the upgrade guide at: http://www.vmware.com/pdf/vsphere4/r41/vsp_41_upgrade_guide.pdf shows that we need to download and install the vSphere CLI client.

So I went ahead to download the client from http://www.vmware.com/downloads:

(The build I downloaded for this upgrade was: VMware-vSphere-CLI-4.1.0-254719.exe)

… and began installing it:

The following screen too extremely long to finish and I remember not having this issue on my last deployment when I installed it on a server so my guess is that I had some other application on my laptop that caused the delay.

Once I completed the installation, I went ahead to download the upgrade package. Make sure you download the proper upgrade package in a ZIP package and not the regular installable ISO as the latter will not allow you to use vSphere CLI to upgrade the host.

While downloading the package, we can spend the time we need to wait to put the host into maintenance mode:

Once you’ve downloaded the zip package, DO NOT uncompress it. Simply place it into a directory of your choice and then open up the VMware vSphere CLI.

C:\Program Files\VMware\VMware vSphere CLI>vihostupdate

'vihostupdate' is not recognized as an internal or external command,

operable program or batch file.

C:\Program Files\VMware\VMware vSphere CLI>dir

Volume in drive C has no label.

Volume Serial Number is 4802-7E84

Directory of C:\Program Files\VMware\VMware vSphere CLI

09/23/2010 06:50 AM <DIR> .

09/23/2010 06:50 AM <DIR> ..

09/23/2010 06:50 AM <DIR> bin

09/23/2010 07:03 AM <DIR> Perl

09/23/2010 06:49 AM <DIR> PPM

0 File(s) 0 bytes

5 Dir(s) 7,619,219,456 bytes free

C:\Program Files\VMware\VMware vSphere CLI>cd bin

C:\Program Files\VMware\VMware vSphere CLI\bin>

As shown in the above screenshot, the vihostupdate.pl script is actually in the C:\program files\VMware\VMware vSphere CLI\bin directory.

In the screenshot above, I actually made 2 mistakes, the first one being running vihostupdate without the .pl extension.

The 2nd mistake is shown in the screenshot below:

I originally unzipped the package because I thought executing the vihostupdate.pl was supposed to be done on a directory when in fact it actually expects a zip package. The following is the output and I’ve also highlighted the error if you were to specify a directory:

C:\Program Files\VMware\VMware vSphere CLI>vihostupdate

'vihostupdate' is not recognized as an internal or external command,

operable program or batch file.

C:\Program Files\VMware\VMware vSphere CLI>dir

Volume in drive C has no label.

Volume Serial Number is 4802-7E84

Directory of C:\Program Files\VMware\VMware vSphere CLI

09/23/2010 06:50 AM <DIR> .

09/23/2010 06:50 AM <DIR> ..

09/23/2010 06:50 AM <DIR> bin

09/23/2010 07:03 AM <DIR> Perl

09/23/2010 06:49 AM <DIR> PPM

0 File(s) 0 bytes

5 Dir(s) 7,619,219,456 bytes free

C:\Program Files\VMware\VMware vSphere CLI>cd bin

C:\Program Files\VMware\VMware vSphere CLI\bin>vihostupdate -server 172.20.70.13

-i -b c:\upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release\

'vihostupdate' is not recognized as an internal or external command,

operable program or batch file.

C:\Program Files\VMware\VMware vSphere CLI\bin>vihostupdate.pl -server 172.20.70

.13 -i -b c:\upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release\

Enter username: root

Enter password:

Please wait patch installation is in progress ...

Invalid bundle ZIP archive, or missing metadata.zip inside.Bundle.zip [/tmp/updatecache/upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release]: File /tmp/updatecache/upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release is too small to be a .zip file

C:\Program Files\VMware\VMware vSphere CLI\bin>vihostupdate.pl -server 172.20.70

.13 -i -b c:\upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release.zip

Once I specified the zip package instead, the update proceeds and completes properly:

C:\Program Files\VMware\VMware vSphere CLI\bin>vihostupdate.pl -server 172.20.70

.13 -i -b c:\upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release.zip

Enter username: root

Enter password:

Please wait patch installation is in progress ...

The update completed successfully, but the system needs to be rebooted for the changes to be effective.

C:\Program Files\VMware\VMware vSphere CLI\bin>

Reviewing what is displayed in the vSphere Client, we can see that the version is still 4.0.0 so all we need to do now is to reboot the server.

Note: Notice the Recent tasks below? Those are from the host update utility’s scans.

Once the host is successfully updated, you’ll see the correct version:

----------------------------------------------------------------------------------------------------------------------------------

Thoughts: Coming from a Windows background, I personally don’t like to do upgrades and I was told by my colleague that our practice lead recommends simply reinstalling ESXi on the host. The problem I have with that is that you lose all your settings so if you have a lot of hosts, this option might be a better route to take.

I hope this has been beneficial to the other professionals out there and possibly even save them some time.

Terence Luk

Pages

Wednesday, September 29, 2010

OCS 2007 R2: “Some calls to and from people outside of your corporate network may not connect due to server connectivity problems… ” message is intermittently displayed to MOC users but calls continue to work

Tuesday, September 28, 2010

OCS 2007 R2 Voice Properties’ Location Profile unusable due to invalid translation rule

Monday, September 27, 2010

How to forcefully reassign assigned disks on a NetApp Filer

Reminder: vCenter 4.1 uses 64-bit DSN | vCenter 4.0 uses 32-bit DSN

Sunday, September 26, 2010

Example with NetApp for realistic expectations of raw and usable capacity

hba / vmhba not showing up with ESXi 4.0 / 4.1

Thursday, September 23, 2010

Updating vSphere ESXi from 4.0 to 4.1 with vSphere CLI

Total Pageviews

Followers