Architecture Case Study: Complexity within High Availability Architectures

Today we will be going over maintaining High Availability. This is very important aspect of architecture one must understand before designing any technical solution.

High Availability refers to eliminating any single points of failure within systems to give them the ability to stay alive and continue business operations without failing. The operational performance businesses promise customers typically fall into guaranteeing a certain amount of up-time per year. This can apply to network devices, databases, and even applications. When calculating and evaluating an end to end system architectures availability, we use the 99% percentile. It is important to calculate a businesses up-time metrics because depending on what industry you're in, lives can be at stake, or even millions upon millions of dollars could be lost.

Load Balancers & Servers (Health checks) - As an example, load balancers are high availability network devices. They eliminate single points of failure within servers and increase performance. By conducting health checks to make sure each server is alive improves availability drastically. Now instead of a dead server incurring excess downtime, the load balancer will redirect that traffic to another server that's alive. Without these devices, businesses could not auto-scale and increase their performance. Load balancers allow them to group hundreds of servers together to scale, increase productivity, and the ability to ensure max performance by eliminating any single points of failure.

Screenshot 2022-04-15 9.47.08 AM.png

The 99% High Availability Theory

The 99% high availability theory is a scientific calculation to determine how one can achieve the highest availability possible. There are organizations that heavily rely on this to increase their availability because it is a very straightforward math equation. The theory states: 99.9% x 99.9% = 99.99%. If you multiply two 99.9's together, you in theory can add one 9 to your final answer. This is a very simple calculation but can easily cost an organization millions or even billions per year to implement. Many cloud providers say their services are 99.9999% or more available, and with all due respect, most of it isn't accurate information.

Now, let's walk through the different levels of availability regarding the 99 percentile!

99.9% Architecture & Downtime Calculation -

The first level is an architecture that maintains 99.9% availability in a given year. This is not very hard to implement at all nor very expensive. In order to do this one must understand that 1 is none, 2 is one, and 3 is greater than two. Having 1 of anything means a single point of failure, and that brings down our availability. In order to prevent this, we must provision a data center in two separate availability zones in two different regions. An availability zone is always within a region, and if we provision two data centers in two availability zones within the same region, we have another single point of failure. If the whole region goes down, we will go down with it. This is how architects must think when designing highly available systems. For simplicity purposes, a data center typically maintains about 99% availability. An architecture with this availability level will have a downtime of 8 hours and 46 minutes a year.

Screenshot 2022-04-15 9.47.13 AM.png

99.99% Architecture & Downtime Calculation -

The second level is an architecture that maintains 99.99% availability in a given year. This is a little harder to implement and can be more expensive than the previous availability level. In the previous example we went with a single cloud, because that availability level can tolerate some downtime. If we want to bring it up higher we're going to have to provision two separate clouds as well. Reason being is if you use a single cloud and the entire cloud goes down, we're down too. This decreases our availability and increases downtime. Keeping the same model from the previous example, we can make an exact copy and implement it in a separate cloud such as Azure. With the first cloud being AWS, each cloud will bring about 99.9% availability and by combining them gives us an availability level of 99.99%. An architecture with this availability level will have a downtime of 52 minutes and 33 seconds a year.

Screenshot 2022-04-15 9.47.18 AM.png

99.999% Architecture & Downtime Calculation -

The third level is an architecture that maintains 99.999% availability in a given year. This is much harder to implement and will definitely be more expensive than the previous method. There are a few ways to approach this. The most elegant solution is adding another cloud. This is because if we want to bring it higher we're going to have to come up with another 99.9%. Organizations like hospitals really benefit from having availability levels this high because there are lots of IoT devices that require the internet to give the correct medication to patients. This means in some cases patients may die if cloud outages prevent doctors or IoT devices from using the Internet. On the contrary, a rural book store would just waste its resources since it's not as critical for them if they incur downtime. Keeping the same model from the previous example, we can make an exact copy and implement it in three separate clouds. We will use the Google Cloud Platform, Microsoft Azure Cloud, and Amazon Web Services. With each cloud bringing in about 99.99% availability, combining them gives us an availability level of about 99.999%. An architecture with this availability level will have a downtime of about 5 minutes and 35 seconds a year.

Screenshot 2022-04-15 9.47.24 AM.png

99.9999% Architecture & Downtime Calculation -

The fourth level is an architecture that maintains 99.9999% availability in a given year. This is one of the hardest to implement and is the most expensive option out of any other previous availability levels. The same logic we've been using thus far will follow us in this architecture. We need to add more 99.9%’s in order to get our availability level higher. If three clouds have shown to give us a 99.999% availability level, then adding another cloud will be necessary. Financial Institutions such as banks that have trading algorithms that can change drastically by 1 nanosecond will be the businesses interested. Maintaining four clouds at once is expensive, but in business the price doesn't matter as long as they make a return on their investment. This is why some companies will spend billions of dollars on an end to end system architecture that almost never goes down. I will add the Oracle cloud to the group of cloud providers we already have and implement the same data center design in the first availability level. With each cloud bringing in about 99.9% availability, combining all of them gives us an availability level of about 99.9999%. An architecture with this availability level will have a downtime of just 31.5 seconds a year.

Screenshot 2022-04-15 9.47.29 AM.png

NB: This architecture is a high-level representation as the full one will be more detailed and much more complex. The intended audience is the general public.

Thank you for your time, and I hope you enjoyed this architecture study case!

Dan, The Architect

May the cloud be with you.

Architecture Case Study: Complexity within High Availability Architectures

Table of contents

No headings in the article.