Disaster, recovery and contingency planning is the biggest task we on the infrastructure side of things are focused on right now.

This initiative got a little bit of needed push this past summer when a 100 year old steam pipe burst in front of our office building forcing us out, with no access whatsoever to our data center, for two weeks.

DR has always been one of those things that always been emotionally charged, mysterious and frightening to me. It can seem like an unsolvable problem. There are so many challenges. Management and user communities often don’t understand the massive scope of providing a ‘turn-key’, and ’seamless’ ‘fail-over’ solution. A favorite saying of one former boss was that we could ’send men to the moon in 1969, so why can’t we (you) do this…?’

The answer to that question is that, yes we can do this - but how much money and resources would you like to dedicate?

Chances are, that counter question (use tact always) will put things back in perspective. If you are really lucky, the answer will be that this is a major priority and will be supported and funded as well or better than any other technology or business initiative. If not, you will be put in a position of how best to use your budget and resources to accomplish as much of a plan as possible.

I’ve always worked in medium-sized businesses, or nonprofit. In these situations IT staff is typically small, and sysadmin/network admin types tend to become pretty versatile.

You are IT manager or director, have high expectations from users and managements, have not so high funding and resources, and are expected to put together a workable IT disaster plan. Where do you start?

There are essentially two approaches I’ve seen so far. The first involves a great deal of planning, meeting, analysis, consulting and takes a very long time. In the end there may be more questions than answers. The second approach involves starting where you are, with what you have and going from there. The advantage of the second approach is that should something actually happen tomorrow, you at least have something in place. It also gives you the opportunity to actually show management what is and is not capable with your given resources. Planning and analysis aren’t bad things - its just that from a technical standpoint, few people outside IT are going to understand the basic technical challenges. I think starting simple is a great strategy.

Where you will need to do your analysis and planning will be in which services you are going make available during an emergency. Learning to communicate effectively with non-IT management is a big key to success.

Our emergency back in July answered some of these questions for us. When our data center went dark, it became immediately evident what our priorities were. It wasn’t what we necessarily would have thought. Some of the need came up from the field to IT while the senior management had other needs that were coming down from above. Our role became to present these needs and establish a priority. This can be a delicate position for IT since there may be conflicting demands. Our job was to look at the needs, evaluate the requirements needed to fulfill them and present a proposal back to senior management.

We were fortunate that we already had a ‘back-up’ email solution in place. Our email solution is a company that essentially spools our email and then forwards it. In the event we go down, our users can log onto a web site and retrieve their email. It is a little more complex than that, but that is basically how it works.

What we ended up with this summer, was a basic, scalable platform to provide and restore services to our users. It took about a week, but in that time we contracted a co-location facility, purchased some servers, and began providing our users with the most essential services.

A lot of immediate issues came to light. One was that we didn’t have the most current backup tapes, since the pick-up had been missed prior to the emergency. We started with what we had.

We provided access via Citrix. We began with the evaluation licenses, working the other details out later.

The resources were on the light side. But, the point is that we got something going in an extremely short period of time. Something is always better than nothing in a situation like this. You can always add on, add space, and expand.

Since this summer, we have moved forward and will be adding an ESX environment as well as replication to our co-lo facility. We are already doing daily SQL dumps off-site.

The primary technical challenge will always be bandwidth. Unless you have a massive pipe between your primary and co-location facilities you will have to make some decisions as to what you can effectively replicate on a real-time basis. The other consideration is that your emergency resources need to be powered, cooled, and secured just like your production resources.

Vendors will try and sell you products, many of which work very well, to do data replication and/or fail-over. The issue is that you simply can not pass more data over your pipe in a given period than data that changes during that given period. If you are using a T1 at your primary and co-lo sites, that T1 not only must provide your normal daily bandwidth requirement, you are now attempting to squeeze all your daily data deltas (or differences/differentials/incrementals) over the pipe. A T1 translates to about 1.5 million bits per second. In actuality, after the overhead of TCP/IP and latency you will be doing well to see 900-1000 Kbps of actual real-time bandwidth. These are BITS not BYTES. We typically measure our data sizes in BYTES. So in the best case, our 1000 Kilobits per second, at eight bits per byte, is 125 Kilobytes per second, 450 MB per hour or about 11 GB per day.

If you have 100 users, that is about 110MB of total data per user per day. This does not include your current bandwidth usage such as email, internet surfing, or anything else you use your Internet connection for. Replication software typically provides some compression, or the ability to replicate data blocks instead of entire files, but you can see the challenges. Most of us will find our connectivity will be the ultimate deciding factor in what we can and can not replicate in real time off site. If the vendor tells you that you can queue your replication for less busy times of day, or ‘drip’ the data, or whatever - just remember this simple math. You can not put more data over the wire per day than it will accept.

One of the things that makes IP based SAN so compelling is the ability to mirror data at block level, and support technologies like virtual computing that make our lives so much better. But, it is expensive.

That’s about it. Don’t be overwhelmed. Start with what you’ve got. Don’t forget that your emergency resources need to be powered, cooled, and secured just like your production resources do. Do the best you can with it and present the results with pride and an attitude of how much more could be done if you had that fat pipe and big SAN.

Photo by Charles Socci - Crews repair steam pipe rupture at 41st Street and Lexington Avenue, New York City July 2007