Foundations of IT Disaster Recovery Planning (Part 1)

Summary

This blog outlines the essential building blocks of a disaster recovery plan, emphasizing the importance of defining scope, recovery time objective (RTO), and recovery point objective (RPO) to prepare for various types of disasters.

Summary
Introduction
What?
Why?
Conclusion

Introduction

You’ve seen it happen to hundreds of other companies. Entire data centers wiped out by major storms. Ransomware that brings an enterprise system to its knees. All hell breaking loose because of a single typo or misclick. A “disaster” can take many forms, but that doesn’t mean you, your colleagues, and even future employees can’t be prepared for whatever comes your way.

In this three-part series, we’ll go over the building blocks of a solid disaster recovery (“DR”) plan, including elements you may not have considered in your initial draft. Whether you’re conducting your biannual test (you do have biannual tests, right?) or putting out a real fire, you and your team will be confident in your procedures.

Let’s start with the basics: What? and Why?

What?

To create a disaster recovery plan, you must know what you plan to recover. Everyone must be on the same page regarding the scope of your DR plan, especially in a large environment with numerous moving parts.

The scope is one of the most deceptive elements of a DR plan because, on the surface, it can seem so simple and hardly worth the extra time to plan out. However, a poorly defined scope can be devastating in the long run, especially if it becomes too large.

After all, the scope can start with just one application…

but this other system needs to be factored in, since it’s also critical…

and this database is on the same server, so it should be included as well…

and the application associated with that database…

Soon, you may find that your single DR plan documents your entire company’s infrastructure! And depending on the size of your organization, that single document could become very long, extremely disorganized, and definitely not what you’d want to skim through in an emergency, when every minute counts.

This disaster of a recovery plan can be prevented by firmly establishing what applications, databases, servers, etc., are “in scope” and “out of scope”.

When an element is “in scope”, it is either directly involved in or significantly affected by the recovery process in your DR plan. Likewise, if something is “out of scope,” it is irrelevant to the current DR plan and likely has its own procedure. For example, if a database is housed in Microsoft Azure, Azure itself would be “out of scope” (Microsoft takes care of that part). Instead, the DR plan’s focus would be on the database, ensuring its users and applications can connect to it and perform their tasks.

To make this work, you must define the application’s “sphere of influence”. This describes the various parts of your environment that the application interacts with or depends on. Of course, there’s a significant overlap between the “sphere of influence” and the scope of your DR plan, but the former is more detailed: It focuses on the relationships between all the moving parts. This is useful for the folks in a “silo”, whose job it is to only focus on one piece of the puzzle (and it’s often not just the end users!).

Clearly defining your servers, databases, application elements, and even the devices people use to access the app will reduce the confusion caused by people shouting server and database names in a panic during a war room.

Why?

However, there doesn’t always have to be a war room. Not every application is business critical. Sometimes, an application outage will only affect a single department or even a single individual. Therefore, especially if you have a large environment, you need to determine what applications:

need to be brought back up first
can afford to wait for a few hours (or a few days) to come back online
need to have the least amount of data loss
can tolerate hours (or even days) of data loss

This part of the plan has two pillars: recovery time objective (RTO) and recovery point objective (RPO).

RTO is another way of saying “How long can this system be down?” Or, if you’re one of the folks on the front lines, the question becomes “How long do we have to bring this system back online?”. An application with a 6-hour RTO will be treated differently from an application with a 20-minute RTO. If these priorities have been laid out beforehand, these environments will be set up differently as well: The more critical application will most likely be utilizing a failover cluster of some kind, while the less critical application might be fine with a single server taking daily snapshots that can be restored over the course of a few hours.

RPO, on the other hand, is another way of saying “How much data loss can we tolerate?”. Even if a business-critical system comes back online in record time, it would be worthless if an entire week’s worth of data is lost in the process.

This is where backups and snapshots take center stage. In SQL Server, for example, full and differential database backups are a solid but less granular restore point. For all intents and purposes, the entire database can be restored in just a few clicks, but you don’t have as much control over how much data loss could possibly occur. If a database only has full backups, then you can only restore that database to the last good full backup, regardless of whether it occurred an hour ago, last night, or even last week. On the other hand, if a database has regular transaction log backups, then you suddenly have much more control over the point to which the database can be restored, even up to just before the disaster occurred, so there might be only a few minutes of data loss, as opposed to an hour or more.

The second part of this “Why?”, as hinted at earlier, relates to the structure of the environment and, more importantly, the cost. Why is there a five-node cluster just for one database, when another database is just fine living on someone’s laptop? Why are we spending thousands of dollars on redundant hardware and disks that can sustain ten days’ worth of hourly backups?

The rule of thumb is simple: The RTO and RPO requirements of your application need to justify the architecture of its environment.* Likewise, the architecture of the application’s environment must be able to fulfill its RPO and RTO requirements.

For example, if you have a database with an RTO of one day and an RPO of six hours, it does not need a dedicated failover cluster and log backups every fifteen minutes kept on disk for three days. In most cases, this would be a huge waste of money and resources that could be used on more critical applications (such as the one in the next example). A database with these requirements would be better suited for daily full backups and six-hour incremental/differential backups on a single server, or even with a DR server if one is already available.

On the other hand, if a database had an RTO of one hour and an RPO of 15 minutes, it should not be kept on just one server with daily full backups. Unless the outage occurred immediately after the daily full backup, critical data would inevitably be lost. At a minimum, log backups would need to be taken every fifteen minutes, as well as a DR server or secondary cluster node, if possible, to fulfill these requirements. It would require a higher allocated budget, especially if another dedicated server is being added, but the additional cost would easily be justified by the application’s recovery requirements.

Now that the abstract details are out of the way, join us for Part 2, where we discuss some of the more concrete elements of the plan: Who is involved in our DR process, and where are our resources located?

*There is an extremely practical loophole you could utilize in certain situations, but we’ll get to that in Part 3…

Conclusion

DR planning can be complicated, and the examples provided in this series are intentionally broad and generic to account for various scenarios. However, we understand that “your mileage may vary”, so if you need DR guidance tailored to your organization, the experts at XTIVIA will be happy to help. Contact us today and discover how our team can help ensure you’re prepared for whatever comes your way.

Please contact us for more information.