Its seems that more and more projects lately have had the mandatory line item “Disaster Recovery” as part of their requirements. Whenever it comes up I try only to pay as much attention to it as necessary because to me it is one of those problems that are not solvable within the specified constraints – that is, the budget you are given. It can become incredibly frustrating if you take it too seriously because, although everyone requires a Disaster Recovery plan, no one can quite define it beyond that. Ask a Business Analyst or VP and they’ll spout off some impossible number of 9’s of availability, period. And if you allow us engineers to define it, we can come up with a myriad of opinions, none of which is “achievable”.
That said, I want to put down some thoughts on the subject. And perhaps some final suggestions/recommendations.
What is Disaster Recovery?
Seems like a good place to start, with a definition. In my book, DR has always been a plan for what to do when something, incredibly improbable, that is out of your control happens, which effects the ability of the masses to reach your applications/services. These are generally classified as problems with “location”, power, cooling & connectivity. Since most folks no longer run their own data center, they are at the mercy of the hosting provider they select. Unforeseeable acts of some sort, include things such as an act of nature – a tornado obliterating the data center, or someone zapping the power grid causing you to go dark, and even to the back-hoe tearing up a bundle of fiber. These are all extraordinary events but none the less, it would seem that they are something we need to plan for, and is the basis for DR.
Its worth it to note that Disaster Recovery typically excludes everything within your control of your piece of the data center. So local networking issues, local hardware issues, local storage issues, local application issues are not part of the plan. After all, this all should be built highly available in some fashion anyways, already.
Disaster Recovery Models
Lets go into some details about the different options I’ve encountered when discussing DR. In general, the idea is for you to maintain a presence in more than one geographically dispersed data center, with the thought that any extraordinary event in one data center won’t occur in the other. Because they are geographically dispersed, the weather conditions should be different enough not to cause a simultaneous act of nature. And same goes since they are on different power grids, and have different network connectivity into the sites. So what are our options?
This is the most operationally expensive and architecturally complex, so lets look at this first. The idea is that you maintain your applications/services in two locations and have them both actively serving production requests. Although this gives you the most return on your investment, since you are using all your assets all the time, it does present a bunch of challenges.
Operationally, your team must maintain the applications on essentially twice the amount of systems as needed to handle all your production traffic, such that each data center is running at only 50% capacity. This is necessary since you have to deal with the extraordinary case when an “event” occurs and the remaining data center that is functioning begins to receive 100% of the traffic. So, since you are maintaining double the machines, you’ll probably need to increase your operational staff to be able to maintain those additional assets.
Architecturally, you must now deal with a whole slew of data management issues. If your application is any bit interesting, it’ll be receiving data, and to both data centers. How do you do that and still maintain some coherence? You can decide that one data center is the master if its available, and always update and query data from it, while syncing the data from the master to the slave. When a DR event occurs, the application must have some logic to decide it now must fail over to the secondary data providers. But now you must be careful not to automatically fail back with out making sure the original master data source was up to date. An ugly problem.
Another option would be to build a data services tier that updates both data sources with the same data no matter where the tier is, and similarly retrieves data from both sites, combines it intelligently somehow, and returns it back to the caller. You still have the same problem of what to do when you recover from the event, and now you have to deal with ensuring connectivity between data centers at all times. Oh, and since they are geographically dispersed, you’ll probably have latency issues to deal with, limiting how quick you can update and retrieve data, and possibly reducing customer satisfaction.
This option can alleviate much of the architectural complexity of attempting an Active/Active configuration, but is still pretty operationally expensive. As the name implies, this model has all requests being serviced out of only one active data center at a time. The standby data center is typically configured so that there is a continuous synchronization (or push) of data into it. There are lots of technologies available for synchronizing data so that shouldn’t incur too much operational overhead.
However, operationally, you still have to maintain two instances of your applications/services in two data centers and if you want to be able to handle 100% of the traffic, twice as many systems. This is nominally less expensive than the active/active model since your operations staff really has to critically deal with only one live data center at a time. But that is the double edged sword. Since they are focused on the live data center, it is likely, with out right set of disciplined folks, that the standby data center will grow mold. The application won’t be upgraded at the same rate as the live instance, and so all hell breaks loose during a DR event.
And since its a standby data center 99.9… (ad nauseum) percent of the time, it is costing you quite a bit to just heat and cool air. You could look at recapturing some of that investment by looking for other uses of the standby, for example, putting non-production instances of the application, like test and staging instances in the standby. If you’ve built things around some virtualization technology like VMWare vSphere, or XEN, or (gulp) HyperV, you could cleverly manage your computing resources so the standby production virtual machines are given the resources they need when even occurs. Depending on the chosen technology, this could be easier or harder to do.
This model is the least expensive, and probably the least complex, but as a result there is less possibility of providing illusive N 9’s of availability.
With this option you do two things. First, as with active/standby, you build full data sources at each data center, and configure a synchronization from one to the other. This gets your most critical piece of the picture in two place always. Or with minimum data loss at least.
You also periodically backup your systems and push the images to the backup data center. You do this in such a way that you minimize the number of systems you need to keep the backup data. Perhaps all you need is a replication of your shared storage, and one or two systems to manage the incoming backups. Using a virtualization technology can help here too. The right technology can help you snapshot, and copy images to your backup datacenter.
Along with all this, you work on agreements with your hosting provider to have systems available for installation and configuration within a specific time period. With this, you build an activation plan. What to when a DR event occurs. This would things like working with your provider to setup and configure the hardware, re-hydrating all those backed up virtual machines, testing and go live procedures.
Clearly, this is not a model to maintain 9’s no matter what the reason, but it is a cost effective option which limits the amount of daily operational complexity a team must deal with.
I think its worthwhile to look at the different event classes in a little more detail as this can provide input on which model to choose.
This is the case where data center connectivity goes dark, for whatever reason. A back-hoe takes out some fiber. A backbone router crashes. A system administrator funks up some routing tables. China hijacks the Internet Traffic.
But here’s the deal, just as you have build your data center space so its network is highly available, any good hosting provider should do the same though multiple peering arrangements, fiber entering multiple locations in the facility, and so forth. And they should be able to redirect traffic if such an event occurs, and maybe so quickly you don’t even notice it, but even if you do, certainly quicker than you can do something about it with any of the models except perhaps the active/active model.
These can occur when a transformer at the municipal utilities blows up, and so forth, which would cause loss of power to your hosting provider. However, same deal, hopefully you have chosen a hosting provider which backup power capabilities. There is still no guarantee that backup power actually goes online appropriately, as happened at 365Main in 2007. As mentioned in the report, power was restored 45 minutes later.
Which begs the question, except for the active/active model, when to you decide to invoke an DR activation plan? Is it time based, governed by the 9’s? Is it by the feel of the severity of the event? Something else? And how do you weigh that against the cost of performing a fail-back or re-synchronization after the primary data center is back online?
Is this really something to worry about? Any facility should have an N+1, N+2 or better set of chillers. If they don’t you shouldn’t even consider hosting with them.
These are the act of nature type of events. From earthquakes and tornadoes, to floods, fallen trees, and so forth. These are events where you have very little probability that you will even be able to get any data back from the facility, ever. This, I think, is the heart of the necessity for Disaster Recovery.
It means, minimally, you must have all your data, your crown jewels, at another location, whether its at another data center, on your laptop, in a thumb drive, or something else. It just needs to exist in two places, minimally, period, no matter how remote the possibility.
There are things you can do to mitigate the possibility of environmental events, too, which can help you decide on an appropriate DR model. Survey the environmental risks of the locations of the hosting providers. For example, Amazon has a huge data center in Oregon which is free from tornado threats, and probably earthquakes too. Flooding could be a concern, however. Then there’s Switch, with a SuperNAP in Nevada, although cooling could be a non-problem, its likely free of the majority of acts of nature. There’s Chicago, San Antonio, Utah, lots of other locations with providers, too.
I started this out by mentioning why, besides all the technical challenges, Disaster Recovery is such a frustrating topic. Its generally poorly defined, and everyone has their opinions on what it really means. So I think the first thing to do if you are stuck in this situation is to define your own very specific and well defined meaning to it and make sure everyone agrees, before you even consider any implementation plan.
It is my contention that DR is something quite extraordinary, and so perhaps one should deal with it in an exceptional way. For example, maybe, instead of thinking about one string of 9’s for uptime, you consider two. Under normal operation, you impose a tighter tolerance for downtime, or say 99.999% uptime, which would allow only 5 minutes of downtime over the year. However, if you are ever so unlucky to encounter a event, perhaps agree on something a little more lenient, even say only 99.9% uptime gives you almost 9 hours of downtime. It seems like a lot. It is a lot. But considering, in my definition DR is about dealing with environmental events, some leeway may be in order. Especially considering the mean time between events, which depending on the data center location, could be decades.
Once you have agreed on availability numbers, this, more than anything else, drives your implementation plan. If you are allowed only a small amount of downtime, clearly you must bite the bullet and go with an active/active solution. But you’ll need to budget for it. On the other hand, if you are given more leeway, perhaps an active/back up strategy makes more sense. You need to do this minimally anyways, as the data must be duplicated geographically.
Don’t forget to run the numbers to compare the lost revenue during downtime against the cost of the solution. You may find that the cost of implementing a sophisticated DR solution is more than the revenue that would have been lost during the downtime.
There are many reasons for building complex geographically dispersed, clustered applications. Whether disaster recovery is a good reason enough is questionable but it really depends on the specifics. Hopefully this will help folks begin those discussions.