High Availability vs. Fault Tolerance vs. Disaster Recovery

May 18, 2023

You need IT infrastructure that you can count on even when you run into the rare network outage, equipment failure, or power issue. When your systems run into trouble, that’s where one or more of the three primary availability strategies will come into play: high availability, fault tolerance, and/or disaster recovery.

While each of these infrastructure design strategies has a role in keeping your critical applications and data up and running, they do not serve the same purpose. Simply because you operate a High Availability infrastructure does not mean you shouldn’t implement a disaster recovery site — and assuming otherwise risks disaster indeed.

What’s the difference between HA, FT, and DR anyway? Do you really need DR if you have HA set up?

High Availability

A High Availability system is one that is designed to be available 99.999% of the time, or as close to it as possible. Usually this means configuring a failover system that can handle the same workloads as the primary system.

In VMware, HA works by creating a pool of virtual machines and associated resources within a cluster. When a given host or virtual machine fails, it is restarted on another VM within the cluster. In Azure, admins use the Resiliency feature to create HA, backup, and DR as well by combining single VMs into Availability Sets and across Availability Zones. In either case, the hypervisor platform is able to detect when a machine is failing and needs to be restarted elsewhere.

For physical infrastructure, HA is achieved by designing the system with no single point of failure; in other words, redundant components are required for all critical power, cooling, compute, network, and storage infrastructure.

One example of a simple HA strategy is hosting two identical web servers with a load balancer splitting traffic between them and an additional load balancer on standby. If one server goes down, the balancer can direct traffic to the second server (as long as it is configured with enough resources to handle the additional traffic). If one load balancer goes down, the second can spin up.

The load balancer in this situation is key. HA only works if you have systems in place to detect failures and redirect workloads, whether at the server level or the physical component level. Otherwise you may have resiliency and redundancy in place but no true HA strategy.

Fault Tolerance

A Fault Tolerant system is extremely similar to HA, but goes one step further by guaranteeing zero downtime. HA still comes with a small portion of downtime, hence the ideal of a perfect HA strategy reaching “five nines” rather than 100% uptime. The time it takes for the intermediary layer, like the load balancer or hypervisor, to detect a problem and restart the VM can add up to minutes or even hours over the course of yearly runtime.

Within VMware, FT ensures availability by keeping VM copies on a separate host machine. With only HA configured, the hypervisor attempts to restart the VM on the same host cluster. If the physical infrastructure powering that host is having problems, HA may not work. With FT, the VM workload is moved to a separate host. Similarly, in Azure that workload could move to a different Availability Zone. In the recent Azure outage, an entire zone encountered significant issues. If users did not have a Fault Tolerant strategy in place using Resiliency across multiple Availability Zones, they likely encountered downtime.

High Availability vs. Fault Tolerance vs. Disaster Recovery

High Availability

Fault Tolerance

Related Topics:

High Availability Offered by Cloud Hosting

Honored to be a "Certifiably Green Denver" Business!

Join our newsletter