Summary
This blog outlines the essential steps for creating a robust disaster recovery plan, including how to perform recovery, when to conduct tests, and what constitutes a disaster.
You’ve seen it happen to hundreds of other companies. Entire data centers were wiped out by major storms. Ransomware that brings an enterprise system to its knees. All hell is breaking loose because of a single typo or misclick. A “disaster” can take many forms, but that doesn’t mean you, your colleagues, and even future employees can’t be prepared for whatever comes your way.
In this three-part series, we go over the building blocks of a solid disaster recovery (“DR”) plan, including elements you may not have considered in your initial draft. Whether you’re conducting your biannual test (you do have biannual tests, right?) or putting out a real fire, you and your team will be confident in your procedures.
In Part 1, we established the basic premise of our DR plan by going over the scope of our application as well as its RTO and RPO.
In Part 2, we determined who is involved in our DR plan and where our infrastructure is located.
In this final installment, we conclude by outlining exactly how our recovery process will unfold, as well as when it should be implemented.
How?
In Part 2, we introduced the concept of the disaster recovery team and the various roles its members could assume. Now, as promised, we will expand on this concept by exploring how this team will perform the recovery process. This is the “meat” of the DR plan and what separates it from mere documentation (not that documentation in general isn’t important!).
First, it’s important to note that no two DR plans are exactly the same. Different applications have different RTO or RPO requirements, infrastructure complexity, security requirements, and so on. Even multiple application databases sharing a single server may all have different requirements!
That said, remember that loophole I teased in Part 1?
Consider a database with a 24-hour RTO. If it is in an Always-On Availability Group designed for a database with a 1-hour RTO, then that should be fine in most cases, given that there is no major impact on budget, available resources, or the rest of that database’s recovery plan. The database with more lax requirements simply “hitchhikes” on the stricter database’s recoverability measures and would be brought back online at the same time as the stricter database when the AG fails over to its secondary node.
Obviously, this would be an ideal and extremely practical DR solution. If you can easily extend powerful protection to less critical systems, it’s a good idea to do so. Otherwise, dedicate your resources to protecting the most critical applications first.
Remember that RPO and RTO are minimums, not limits. In other words, if the issue is resolved in an hour instead of a day, that is much better than the issue being resolved in a day instead of an hour.
Now, onto the individual DR process: This should be a set of specific step-by-step instructions for each member of the disaster recovery team to perform. If you’ve ever been involved in a sudden war room, you know how chaotic they can be. That makes it all the more important for there to be a clear and established procedure that everyone can understand and follow to resolve the situation as quickly and efficiently as possible.
An important thing to note is that the steps should be sequential, especially if your environment is particularly sensitive, as a single misstep can send the team back to square one. If databases need to be restored in a specific order, if services must be started or stopped in a particular sequence, or if an app’s configuration settings are difficult for the average user to locate without navigating through multiple menus, then these sequences should be highlighted in your procedure.
Conversely, if there are steps that can be performed simultaneously by different team members, then this should also be mentioned in the procedure to prevent unnecessary delays.
Also, keep in mind that the DR process is for everyone on the disaster recovery team to follow. Therefore, do not be afraid to include technical details, such as scripts (and/or file locations of scripts), as well as connection strings, for quick reference. It may seem like useless jargon to the more non-technical members of the disaster recovery team, but if it helps your application come back online that much faster, then it’s invaluable to have on hand.
When?
The final topic of this blog series is when the DR process should be carried out, both in real-world scenarios and as a test. After all, it’s not just written to be referenced–it’s written to be used.
Think of your DR test as a fire drill: everyone needs to know what to do when the alarm sounds, and everyone needs to go through it at least once to understand this life-saving process. Just like how different workplaces have different amounts of fire risk, the frequency of your DR tests should depend on how critical your application and infrastructure are, as well as how likely a disaster (or what you would consider to be a “disaster”–more on this in a bit) is to happen. As a best practice, DR plans should be reviewed and tested at least twice a year; however, you are welcome to schedule more frequent tests if the need arises (or if you’re simply that cautious).
Although there is no substitute for a full DR test, at the bare minimum, the DR plan should be reviewed by the entire DR team every six months, even if no test is actually conducted. This ensures that:
- The list of team members and their respective roles is up to date, factoring in employee turnover where applicable
- New team members are familiar with the DR process and are easily able to follow it on short notice
- Existing team members are reminded of the process and have the opportunity to make changes if needed
- The process itself can still be followed as written, with updates for elements such as server/database names, connection strings, and user interface changes, among others.
- The scope of the DR plan is still relevant to the app’s infrastructure (for example, updating the scope after a migration from on-prem to Azure/AWS)
After all, the last thing you want to have happen during an actual disaster is for your DR plan to be months or even years out of date.
Now that your DR plan is not only complete but also frequently reviewed and tested, the last thing to establish is: What constitutes a “disaster”?
There are many types of disasters with varying scales. For these blogs, the main focus has been on sitewide disasters that require failing over to a secondary (or tertiary) physical location. However, a disaster can be as minor as corruption on a single disk or a hardware failure that renders an entire server inoperable until a replacement can be built. Some “disasters” can even be resolved by just restoring the database(s) from a previous backup. Ideally, the DR plan should account for all of these scenarios, no matter how minor, especially if a single server failure can cause a massive ripple effect across your entire organization.
A final note: Often, if some degree of high availability is needed at an on-prem location, two cluster nodes in an Always-On Availability Group will reside at the same physical data center (with little to no data loss due to synchronous data transfer) for quick and convenient failovers during planned maintenance and hardware-level failures. If one of these failovers is needed for a small-scale disaster, there should also be instructions for failing over to the paired synchronous node within the same data center, as well as instructions for ensuring that the rest of the application infrastructure can connect to the new primary node, regardless of its physical location.
I hope you enjoyed this three-part series about the foundations of DR planning, as well as the importance of disaster recovery in any environment, whether small-scale or enterprise-level. Hopefully, this information will help you flesh out your DR plans and even save your organization when disaster strikes. If it does, feel free to let us know!
DR planning can be complicated, and the examples provided in this series are intentionally broad and generic to account for various scenarios. However, we understand that “your mileage may vary”, so if you need DR guidance tailored to your organization, the experts at XTIVIA will be happy to help.
Contact us today and discover how our team can help ensure you’re prepared for whatever comes your way.
Check out Foundations of IT Disaster Recovery Planning (Part 1) and Foundations of IT Disaster Recovery Planning (Part 2)