Whitepaper: Achieving High Availability with Exchange Server at Microsoft
I have often wondered how some of the IT companies use their own products, especially ones that have less than stellar reputations with customers. Microsoft has released a whitepaper that outlines how their internal IT department supports Exchange. I thought to myself 'Now, this is something that I have got to read!'. In the end, it is best if you read this publication through twice. The first time, read it through just to see how Microsoft goes about measuring downtime and how they determine SLAs and strive to meet them. Don't pay much attention to the technology, but rather compare their methodology to the one you use in your current position. This is where we Domino administrators can get the most value from this paper. The second time around, feel free to make snide remarks about their ideas surrounding clustering and the limitations they have to work with when implementing Exchange. Laughing out loud is not only encouraged, but truly beneficial to the soul.

The first thing that struck me was their definition of an outage.

Any downtime of e-mail services counts against availability goals, even if not caused by Exchange.
So if the switch a server is connected to blows up, it counts against the Exchange group. If the Active Directory server implodes or the DNS gets hi-jacked (personal experience on this one), their numbers take a hit. Try hitting the four nines with that type of monkey hanging on your back. Basically, this makes the Exchange group the team that is responsible for having everything working correctly on the network. In my mind, that is a lot of responsibility and I have never been in an environment where the email group had enough authority to make that a feasible situation. It's usually the networking group that is at the top of that chain of command in most companies. Chris Nelson, Director of Messaging, states it literally -- "We now own what we don't own". Kind of like being a step-parent.
Over the last several months, seven percent of total Exchange downtime has been for planned Exchange upgrades. Six percent of Exchange downtime has been due to other Exchange-specific issues. The rest of the downtime—87 percent—was caused by issues outside Exchange.
I guess that not having any major upgrades in the last 3 years has at least one positive result. It is my experience in the Domino environment that the numbers are the same if not better. Most of the unplanned outages had to do with networking or hardware issues, not Domino software issues.

I really liked the emphasis they place on writing meaningful and measurable SLAs. The 3 pronged dashboard approach of measuring messaging availability (Mailbox availability, Mail delivery time, and Email Client Availability) is straight-forward and can be used in any messaging environment. Pay close attention to the guidelines for creating meaningful SLAs and the review progress sections as they contain good information that you will be able to use immediately. Their insights into what 24x7x365 really means will give you and your management something to think about. I was a little surprised that their SLAs for historical mail restores is 2 days, but that's life in a SCOS world. Probably the best idea I saw was that one single person is responsible for each measurement's reporting and must investigate any deviations from the norm. SLAs and metrics always work better when you have defined responsibilities for measuring them.

The architecture of the Exchange 2003 servers was very interesting to me. For most of the mail infrastructure, Microsoft is using a 7 node Windows server cluster to host 20 Exchange mail databases (mail stores) with approximately 200 users per database. The average mailbox size limit is 200 MB so the mail stores average about 4,000 users/40 GB of storage per virtual server. What is not listed is the number of virtual Exchange servers are hosted on each server cluster. Without that information, it is unclear how much horsepower is needed per user and, therefore, impossible to make an accurate comparison to what I have worked with in Domino. In addition, I found their reasoning behind implementing clustering to be a little self serving.

Perhaps the biggest benefit of using Windows Server 2003 clustering on current enterprise-class hardware is not its failover capabilities but rather its management flexibility and the effect that has on planned downtime. Finding ways to reduce planned downtime can contribute even more to increasing availability than increasing your ability to cope with unplanned downtime.
So since Exchange cannot do true Active/Active, share nothing clustering, the load balancing and failover capabilities of clustering are downplayed. In my 10+ years of experience with Domino, even with quarterly upgrade releases, version upgrades, and regular windows and third party software updates, the amount of unplanned downtime from external sources, whether due to batteries in the UPS failing during a power spike or viruses killing DNS lookups, significantly dwarfs the amount of planned downtime. And this is even more true on non-Windows platforms, where critical fixes are sent out with the frequency of new AOL CDs.
Clustering contributes significantly to reducing planned downtime, because an Exchange virtual server is not tied to a single Windows host, but can be moved from one cluster node to another when software updates or hardware maintenance is necessary. This does not completely eliminate planned downtime, because a few minutes may be required to move Exchange services between cluster nodes. However, Exchange services do not have to be down for the entire duration of planned installations, reboots, and hardware replacements.
I am at a true loss for words when commenting on this. This is not clustering, this is having a hot spare ready all the time. In Domino, to see clustering work is as easy as pulling network cable on one of the servers in a cluster and watching all of the Notes clients AUTOMATICALLY fail over to one of the other servers in the cluster. And then plug the cable back in and watch the databases synch back up.

What's even worse as far as I am concerned is the level of expertise needed to get exchange running on a Windows Server cluster.

Installing a single Exchange server is straightforward, especially with the step-by-step guidance of the new Deployment Tools in Exchange Server 2003. Nonetheless, deploying a supportable Exchange system in a worldwide or enterprise environment requires significant expertise. The interaction of Exchange with Active Directory requires attention and thoughtfulness of design. Comprehensive Microsoft deployment guides and white papers can assist system architects in designing and building an effective Exchange system.
In contract, setting up a Domino cluster takes no additional expertise. Simply add the desired servers to the same cluster via the Administrator and common databases will begin replicating in real time via the cluster replicator. Sure, there are some things that can be done to optimize the cluster, but in order to get it up and running takes about 5 minutes, less if you already decided on a name for your cluster. No wonder it seems that Exchange shops employ so many administrators.

The true difference between the way Lotus does clustering and the way Microsoft sells it is illustrated in their typhoon example. Whereas a Domino administrator would have just put a cluster member in a different data center in case of a similar issue in the future, the MS team decided to move the entire Exchange cluster because the data center wasn't able to meet their needs. Talk about high maintenance!!!

via Peter de Haas

<< Previous Document / Next Document >>
  • 1) Interesting post... - Duffbert
    Created 1/5/2006 8:35:28 AM email | website

    Thanks for posting this, Sean... I still have to download the paper and read it, but it sounds interesting. Stuff like this makes me glad I became a developer instead of an administrator... :)


Discussion for this entry is now closed.