Foundations of IT Disaster Recovery Planning (Part 2)

SUMMARY:

Robust IT disaster recovery planning requires organizations to meticulously define both the essential human stakeholders—including a multi-role disaster recovery team—and the detailed physical and virtual locations of all application infrastructure to ensure swift recovery and meet established RTO and RPO metrics.

The crucial stakeholders involved in IT disaster recovery include the core technical recovery team and end users who are directly affected by an outage.
The dedicated disaster recovery team, responsible for fulfilling RPO and RTO requirements, typically includes roles such as the Database Administrator, Server Administrator, Application Owner/SME, and end user representatives.
It is critical to document the location of infrastructure in detail, even within complex cloud environments (like Azure or AWS), by including specifics on access credentials, failover processes, connection strings, and key vaults to speed up recovery.
For hybrid or fully on-premises environments, the DR plan must note the physical locations of the primary and all off-site data centers to properly prepare for localized physical disasters, such as fires or major storms.

Thorough preparation regarding personnel responsibilities and precise infrastructure documentation ensures the business can confidently respond to any form of disaster, from ransomware to physical catastrophe.

SUMMARY:
Who?
Where?
Need Help with Your DR Plan?

You’ve seen it happen to hundreds of other companies. Entire data centers were wiped out by major storms. Ransomware that brings an enterprise system to its knees. All hell is breaking loose because of a single typo or misclick. A “disaster” can take many forms, but that doesn’t mean you, your colleagues, and even future employees can’t be prepared for whatever comes your way.

In this three-part series, we’ll go over the building blocks of a solid disaster recovery (“DR”) plan, including elements you may not have considered in your initial draft. Whether you’re conducting your biannual test (you do have biannual tests, right?) or putting out a real fire, you and your team will be confident in your procedures. In Part 1, we established the basic premise of our DR plan: the scope of our application, its RTO (how fast it needs to be recovered), and RPO (how much data can be lost). Now that we have these basic details, let’s proceed to determine who needs to be involved and where our infrastructure is located.

Who?

There’s a very good chance that your workplace isn’t solely composed of robots and AI (not yet, anyway…). There are humans involved: humans who participate in the recovery process, and humans who are directly affected by an outage (all of these humans, by the way, are collectively referred to as “stakeholders“). These stakeholders are a crucial part of your DR plan and should be treated accordingly.

In many cases, especially with smaller applications, the end users are also responsible for maintaining it. This is, of course, the ideal situation, and it would streamline communication about the DR strategy and procedure.

Alas, this isn’t always the case. If your end users are less technical, then communication about recovery procedures becomes that much more critical. There will likely be a section of the DR plan that explains what end users should do to ensure your business doesn’t stop in its tracks. We’ll come back to this later; for now, let’s focus on the technical folks.

If you are creating, reviewing, or revising this DR plan (or reading this post), then you are likely part of the disaster recovery team. The disaster recovery team is exactly what it sounds like: the group of people in charge of getting your application back online and fulfilling the established RPO and RTO requirements (see Part 1).

The disaster recovery team will likely include (but may not be limited to):

Server Administrator and/or Cloud Administrator, responsible for steps taken at the server/OS level to ensure the underlying infrastructure is back online.
Database Administrator, responsible for steps taken at the database level to ensure the application can connect to the database.
Application Owner/SME (Subject Matter Expert), responsible for configuring the application to run on the new primary server or making sure it runs on the primary server when it is back online
One or more End User(s) are responsible for testing the application to verify it is back online, as well as facilitating the Business Continuity Plan if needed (more on that in a bit).
For larger and more intricate organizations, a dedicated Disaster Recovery Specialist facilitates the creation, review, testing, and revision of DR plans for different teams and applications.

Each member of this team has an essential and specific role in bringing your application back online. In fact, most of the DR plan involves documenting who is responsible for what and when. However, we will discuss this in much more detail in Part 3. For now, though, consider this a brief introduction to the idea of a disaster recovery team.

But what about the end users from earlier? Do we expect them to sit and twiddle their thumbs, especially if the outage lasts longer than the established RTO? If you are creating a DR plan for a widely used, high-impact application (or even if you aren’t), it is good practice to add a business continuity plan to a DR procedure.

A business continuity plan is a description of “offline” methods for end users to continue their day-to-day operations even during an extended outage. These can range from custom spreadsheets to pen-and-paper data collection and entry. The important thing is the ability to collect and process data in the interim. Additionally, after the application is brought back online, end users can integrate the data collected during the outage into the application to prevent further data loss.

Where?

For many people in today’s world, this question seems almost irrelevant. In a way, this section is much shorter than it would have been ten years ago. However, it is still crucial to document this information in as much detail as possible.

You might be saying to yourself, nonchalantly:

“My entire infrastructure is in Azure/AWS/etc.! I don’t need to worry about this part of the DR plan!”

Even without on-premises servers, you still need to document your environment. Especially if your company’s cloud environment is highly complex or spans multiple platforms, providing as much information as possible about the location of these servers within your cloud environment will significantly speed up the recovery process.

This information can include:

Failover processes (where applicable)
Access and credentials
Connection strings
Key vaults
and so on.

For example, suppose you have a financial application, along with its respective servers and databases, locked down in its own Azure subscription for security and accounting purposes. In that case, that is critical information to include in the DR plan. If it isn’t, your poor administrators will be frantically combing through the “general” Azure subscription all afternoon looking for these server and database names. Or, if a disaster affects the only application infrastructure you have hosted in AWS, hours or even days will be wasted searching your on-premises or Azure environments for the servers to fail over or bring online – valuable time that could be spent on the recovery process itself.

Finally, we move on to the more traditional definition of this “where”: on-premises data centers. For fully on-prem (and hybrid–that is, part on-prem, part Cloud infrastructure) environments, noting which physical location houses the server(s) in question is critical due to the possibility of physical disasters: disaster events that affect a particular physical location (such as a fire or a large storm). It is especially important to note the locations of all off-site data centers, as well as the location of your primary data center, and which servers/nodes (if any) reside in each. This is especially important if some servers/databases have their primary replicas in physical data centers that are considered secondary for others.

For example, let’s say a large company in California’s Inland Empire has its primary data center in Barstow, its secondary data center in Victorville (about 32 miles southwest of Barstow), and its tertiary data center in Palm Desert (about 122 miles south-southeast of Barstow). If a fire or an extended power outage occurred at the Barstow data center, any servers, applications, or databases housed there would have to fail over to the Victorville data center, if possible. However, if a significant natural disaster, such as a massive wildfire, were to affect both Barstow and Victorville, then the infrastructure would have to fail over to the Palm Desert data center — again, if possible. In this scenario, whether servers and databases are replicated from Barstow to Victorville (or vice versa) or even to Palm Desert is governed by their respective applications’ RPO and RTO, which were established in Part 1.

Now your DR plan is starting to come together, but there are still some huge parts missing! Stay tuned for Part 3, where we discuss how the DR process should be completed and, finally, when it should be performed.

Need Help with Your DR Plan?

DR planning can be complicated, and the examples provided in this series are intentionally broad and generic in order to account for many different scenarios. However, we understand that “your mileage may vary”, so if you need DR guidance tailored to your organization, the experts at XTIVIA will be happy to help. Contact us today and discover how our team can help ensure you’re prepared for whatever comes your way.

Check out Foundations of IT Disaster Recovery Planning (Part 1) and Foundations of IT Disaster Recovery Planning (Part 3)