The Four Pillars of High Availability

Ali Kanso
5 min readSep 27, 2021

Imagine you are a Site Reliability Engineer (SRE) that is tasked with examining if an application deployment in your cluster is enabled for high availability. Where would you start? What are the main aspects you should analyze?

In this article, we argue that most transient failures can be mitigated by leveraging a high availability (HA) solution that encompasses four main pillars.

We consider a simple multi-tiered application, and examine what it would take to enable it for high availability.

Simple multi-tiered application

Defining HA:

In information technology (IT), HA is attained when availability reaches five-nines, which means the system is available 99.999% of the time. This leaves room for roughly 5 minutes of downtime per year.

The Four Pillars:

Going back to our original question of how to make an application highly available, the SRE should ask themself two fundamental questions:

1- is there any single point of failure in the application?

2- how do I eliminate those single points of failures?

If the answer to the first question is yes, then the answer to the second question requires implementing an HA solution that encompasses the following four pillars:

Redundancy

Failures are bound to happen. It is only a matter of when. Having a single point of failure is mitigated by redundancy. In IT systems, regardless how reliable a software component is, it still relies on many other components (hardware, OS, Hypervisor, network element, etc.) that may fail to perform their functionality. Luckily, the cost of redundancy for software applications dwarfs in comparison to other systems such as an aircraft or a space shuttle where mechanical systems are replicated.

Replication across tiers

Redundancy is needed across all the tiers of the application. A component of a given tier is replicated with one or more “identical” replicas.

Typically load balancing is needed to split the workload across the redundant replicas of the application. For instance at the business logic tier we might have front-end web-servers that are serving static and dynamic web-content, backed by machine-learning based recommendation systems and content-management. The workload at this layer is typically split among the active replicas through load balancers. It is worth noting here that although dedicated load balancers are the norm, client-side HTTP or GRPC load balancing based on DNS entries can also be used in some situations.

While redundancy is commonly associated with load balancing, this is not always the case. It depends on the redundancy model being used. For instance, an active/standby redundancy model does not require a load balancer, instead it requires load redirection once a failure is detected and the standby replica assumes the active role. The question is how do we detect failures? This brings us into monitoring.

Monitoring

Without monitoring, redundancy can become impotent. A constant monitoring mechanism needs to be in place to ensure that the workload is reaching a healthy replica. Load balancers use basic monitoring mechanism that scan your application’s component ports, and specific HTTP/GPRC endpoints, and stop sending traffic to non responsive backends. However this might not be enough, being alive and being ready are usually different. For instance the business logic tier replica might be alive but if the data-layer is down, then the business layer would not be ready to handle traffic.

A typical approach is to have the application’s component assess its own health/readiness and expose this information on a specific endpoint. However, once we detect that an application’s component is faulty, or not healthy, what do we do about it? This brings us into recovery.

Recovery

The main purpose of monitoring is to trigger recovery. Recovery is composed of a series of steps where order is important to ensure consistency.

Recovery steps

The first step in recovery is to isolate the faulty component. Note that faulty might not always mean failure, if the monitoring system flags a machine as being a security risk for not having the latest operating system patch, then you would want to isolate all the application’s components running on that machine ASAP. Isolation consists of preventing traffic from coming in or out of the faulty component and terminating it.

The second step of the recovery is to make sure the workload that was once assigned to the faulty component is now redirected to another healthy replica. In an active/active redundancy model the other replica is ready to receive requests. However in an active/standby model (mainly used at the data-layer, e.g. strongly consistent databases), the standby replica might need to become active first before receiving traffic. A typical case is when a leader election process needs to take place in order for the standby to become active.

Repair may consist of a simple in-place restart, or stopping the application’s components on one machine and starting it on another (failing over). For most transient errors that might be enough. Other repair actions include re-provisioning (e.g. recreating a VM and re-installing the software stack with a fresh configuration). If automated repair is not successful, then human intervention might be needed to bring the system back to normal.

The final step is making sure the recovered application’s component rejoins the system. I.e. making it known to the DNS server, load balancer, monitoring system, service discovery, etc.

Checkpointing

While nowadays we are seeing more stateless components in applications, stateless applications as a whole are rare. The majority of applications have their state pushed to the data-layer. And, to consistently replicate the data-layer components, checkpointing is needed. A checkpoint is simply a snapshot of the system state that we can fall back to in case of failure.

Failure recovery in a stateful component

Conclusion:

High availability is based on well established principles:

  • Redundancy
  • Monitoring
  • Recovery
  • Checkpointing

The four pillars presented in this articles are timeless pillars that appear in virtually every highly available system one way or another. Regardless whether the system is a database management system, or a large application like a social networking service with billions of users. The difference would be the scope (cluster/zone/region) where the solution is applied. The scope would also determine the difference between high availability and disaster recovery.

The mechanisms used to implement each pillar are fundamentally different, yet, when they work together in a well choreographed manner, the magic of high availability occurs.

--

--

Ali Kanso

Ali is a Principle Software Engineer at Microsoft, He previously worked for IBM Research and Ericsson Research. He holds a Ph.D. in computer engineering.