High Availability vs. Fault Tolerance vs. Disaster Recovery
You need IT infrastructure that you can count on even when you run into the rare network outage, equipment failure, or power issue. When your systems run into trouble, that’s where one or more of the three primary availability strategies will come into play: high availability, fault tolerance, and/or disaster recovery.
While each of these infrastructure design strategies has a role in keeping your critical applications and data up and running, they do not serve the same purpose. Simply because you operate a High Availability infrastructure does not mean you shouldn’t implement a disaster recovery site — and assuming otherwise risks disaster indeed.
What’s the difference between HA, FT, and DR anyway? Do you really need DR if you have HA set up?
A High Availability system is one that is designed to be available 99.999% of the time, or as close to it as possible. Usually this means configuring a failover system that can handle the same workloads as the primary system.
In VMware, HA works by creating a pool of virtual machines and associated resources within a cluster. When a given host or virtual machine fails, it is restarted on another VM within the cluster. In Azure, admins use the Resiliency feature to create HA, backup, and DR as well by combining single VMs into Availability Sets and across Availability Zones. In either case, the hypervisor platform is able to detect when a machine is failing and needs to be restarted elsewhere.
For physical infrastructure, HA is achieved by designing the system with no single point of failure; in other words, redundant components are required for all critical power, cooling, compute, network, and storage infrastructure.
One example of a simple HA strategy is hosting two identical web servers with a load balancer splitting traffic between them and an additional load balancer on standby. If one server goes down, the balancer can direct traffic to the second server (as long as it is configured with enough resources to handle the additional traffic). If one load balancer goes down, the second can spin up.
The load balancer in this situation is key. HA only works if you have systems in place to detect failures and redirect workloads, whether at the server level or the physical component level. Otherwise you may have resiliency and redundancy in place but no true HA strategy.
A Fault Tolerant system is extremely similar to HA, but goes one step further by guaranteeing zero downtime. HA still comes with a small portion of downtime, hence the ideal of a perfect HA strategy reaching “five nines” rather than 100% uptime. The time it takes for the intermediary layer, like the load balancer or hypervisor, to detect a problem and restart the VM can add up to minutes or even hours over the course of yearly runtime.
Within VMware, FT ensures availability by keeping VM copies on a separate host machine. With only HA configured, the hypervisor attempts to restart the VM on the same host cluster. If the physical infrastructure powering that host is having problems, HA may not work. With FT, the VM workload is moved to a separate host. Similarly, in Azure that workload could move to a different Availability Zone. In the recent Azure outage, an entire zone encountered significant issues. If users did not have a Fault Tolerant strategy in place using Resiliency across multiple Availability Zones, they likely encountered downtime.
It might seem as though you don’t need a disaster recovery infrastructure if your systems are configured with HA or FT. After all, if your servers can survive downtime with 99.999% or better availability, why set up a separate DR site?
DR goes beyond FT or HA and consists of a complete plan to recover critical business systems and normal operations in the event of a catastrophic disaster like a major weather event (hurricane, flood, tornado, etc), a cyberattack, or any other cause of significant downtime. HA is often a major component of DR, which can also consist of an entirely separate physical infrastructure site with a 1:1 replacement for every critical infrastructure component, or at least as many as required to restore the most essential business functions.
DR is configured with a designated Time to Recovery and Recovery Point, which represent the time it takes to restore essential systems and the point in time before the disaster which is restored (you probably don’t need to restore your backup data from 5 years ago in order to get back to work during a disaster, for example).
A DR platform replicates your chosen systems and data to a separate cluster where it lies in storage. When downtime is detected, this system is turned on and your network paths are redirected. DR is generally a replacement for your entire data center, whether physical or virtual; as opposed to HA, which typically deals with faults in a single component like CPU or a single server rather than a complete failure of all IT infrastructure, which would occur in the case of a catastrophe.
Which availability strategy is the best fit?
A disaster recovery plan remains vital to business continuity strategy for many reasons. The most apparent is that the “heartbeat” system, like the load balancer above or a system that monitors the redundant load balancers themselves, must work perfectly in order for HA to succeed. If your hypervisor system goes down with your servers, its HA functions may not work properly.
You must configure your data storage to function with both the primary and redundant HA system. Data mirroring must work perfectly lest you end up with newly spun up servers and no or outdated data to populate their apps. The data storage systems themselves must be set up as highly available, with no single point of failure. The interconnections between your HA systems must also function perfectly, so diverse connection paths and redundant network infrastructure is also required. All of this should occur without any human intervention — the HA platform should know to seek new local storage or the different arrays/SANs to be used at the failover site.
While the majority of the time HA will work without a hitch, the chances of a problem rearing its head at some point in this complex stack becomes ever more likely the more you consider the elements at play. HA can be expensive and difficult to configure and administrate, even when using a ready-made cloud solution.
DR on the other hand is relatively inexpensive, as your stored systems can be configured to your desired RPO and RTO and you only pay for the storage rather than the running workloads. In HA and especially FT your backup servers must be ready to turn on at a moments notice so you are likely to incur charges for those resources on a constant basis. With DR you only pay for the servers when they are spun up from a (presumably geographically separate) pool of compute resources.
HA is a great fit for the most critical of applications and systems, the very backbone of your organization. Anything that must be running 24/7/365. For the rest, DR is a more cost effective solution that can still deliver recoveries within minutes.