Tuesday, July 01, 2008
Disaster Recovery
(Note: This is a reprint from our July 2008 Newsletter)
In our last edition, we covered the topic of catastrophic failure. It can happen, does happen, and probably will happen – to you. The longer you rely on IT infrastructure the more exposed you become to a failure occurring. It may be minor, but in many cases, a failure of an essentially cheap piece of equipment can stop a business in its tracks entirely, often for days.
So assuming it will happen eventually, and there's no way to stop it, how do you recover from such an event? We'll use an analogy which most people will be familiar with to illustrate first. Car accidents. If you haven't been in one, you've seen one, or know someone who has, and generally understand the process. We often see clients say the equivalent of “I've just been in a car crash, should I put my seatbelt on now?”
Obviously the time to put the seatbelt on was before you started driving, and the same goes for disaster recovery. Preparedness is everything.
The two fields information technology primarily services in small business are information processing, and communications infrastructure. Information processing is your word processing, spreadsheets or databases. Communications infrastructure is your e-mail, VOIP telephony, office networks and so on. We'll tackle information processing failure first. Regardless of the exact nature of the failure, we have one thing working on our side. Essentially, and I'm drawing a fairly long bow here, all computers are exactly the same. At an abstract level at least, they're all the same. That's how I know that if I install Word on 50 desktops in an office, they will all operate in the exact same manner. I can save my word document, send it via e-mail across the world and back, and a computer in America, or Taiwan or Antarctica will be able to read that document in exactly the same manner. So the actual computer becomes almost disposable, and should be viewed as such. The key component here is the information itself (the Word document), and the application processing it (Microsoft Word, or Open Office if you're that way inclined). As long as the information is safe, we can get you up and running again relatively quickly. This means backups. And like wearing a seatbelt every time you drive, backups require consistent, disciplined application if they are to be of any real value. There are a range of backup solutions available to choose from depending on how risk averse you are and how much data you need to back up, but they all have one thing in common. You need to plan them, put them in practice, and importantly, review their effectiveness. This means doing what we call a “trial restore” where a file or selection of files are restored from the backup to ensure they were recorded properly.
This is essential, because like any other device, there is always the chance that whether your backup device is tape or CD or DVD or an external hard drive or network copy, it could itself be malfunctioning. Obviously, the worst time to discover this is when something else has failed. If you effectively back up your data and retain a copy of that backup which can be loaded onto another computer, the failure of a single computer becomes an annoyance more than a disaster.
Now that you have a regular backup scheme in place, and periodically confirm that it's working, how do you actually recover from a failure? Do you need to repair the fault first? Not necessarily. If you have more than one computer in your organisation, chances are you can limp along by restoring the data and applications to another PC and using it. You may end up with staff sharing that computer which is frustrating, but it will get you up and running again quickly. If you need to replace a failed component, in most cases you can do so in around 12-24hrs. The caveat here is that only applies if the component which failed is still in production. The internal organs of PCs which users are most often blissfully unfamiliar with aren't exactly Lego bricks.
While you can swap them around from time to time the way they plug together changes and manufacturers stop making the older versions. If you have server which is five or six years old, which is not uncommon, not only is it nearing the end of its useful life and likely to fail, the chances that you'll be able to find replacement components to match are rapidly diminishing. This rules out the option of ducking down to the local computer hardware store, grabbing a new hard drive or whatever failed, and getting up and running quickly. You may find that a whole new server will be required. Read that again, whole new server. As we've mentioned numerous times in the past, this is an unfortunate but unavoidable aspect of IT. You need to be prepared to replace hardware periodically. It is obviously by far preferable to go about this in a planned and orderly manner, rather than a rushed panic due to a failed component bringing your business to a halt.
Communications infrastructure failures can be more frustrating, especially when multiple service providers are involved. For example, the failure of an Internet connection may be rats gnawing through cables in your roof cavity, fallen tree branches taking down your phone line and associated ADSL connection, a failure at your ISP, a failure at the phone exchange or any combination of the above. Once again, the concept of a backup comes to the fore, but in a slightly different manner. We're now talking about “redundant architecture”. Redundnant, because you don't really need it, until your main infrastructure fails. We'll use an internet connection as an example. Typically a small business will have a server connected to an ADSL service by a router. The kind of hardware you can buy at the local department store. If the router fails, and you don't have one in the cupboard, you'll be off air for as long as it takes to drive to the shops and buy a new one, then reconfigure it. What if the phone line fails? You can have a second line with a backup ADSL service waiting at the ready, but chances are better than good that a stray tree branch won't discriminate between lines. Also, if both lines are with the same service provider and the ISP fails, you've now got effectively double the useless bandwidth. Not an improvement. Similar arguments exist for many redundant schemes. Your most effective option, and one we've employed freqeuntly is a service with as little in common with your regular service as possible. As an example, when a client's wired broadband service fails (ADSL/TransACT or similar), we can deploy a wireless service using a different provider altogether. This helps overcome failures on a local level. In the past we've been able to allow a client to keep operating for several days while their ISP attempted to diagnose and repair a fault. We did this by loaning them a wireless router which connected to the Internet via the same kind of network mobile phones use, but optimised for data transmission instead of voice calls. It meant their e-mail was disrupted for a matter of hours, instead of the week or so it took to get their normal service restored.
There are however bona fide show stoppers. The classic example is “What if the building burns down?” While this rarely happens, it is often presented as the worst possible senario. We get asked how quickly we can get everyone back working. The tuth of the matter is, that even with the most regimented backups, your IT needs will be far from the fore in terms of urgency. If your building has literally burnt down, you'll need a new fleet of computers. And desks. And chairs. And phones. And a building. It's also true that once you have these things, if you have followed your backup procedures properly, it will be a relatively routine procedure to get the new equipment loaded with your old data and productive again. By comparison, finding new premises, furniture, hardware etc. will be a much bigger challenge.
So to summarise, disaster recovery has more to do with prevention and preparedness than reaction. In the case of information processing, having a backup of your applications and their data is absolutely vital. We cannot stress how important this is. In the case of communications infrastructure, having a distinct fallback service is vital. If you are to rely on these backups, they must be tested periodically to ensure they actually work as expected.

