Wysoka dostępność
Kamil Porembiński
Kamil Porembiński
12.10.2019

What is high availability of server or website?

Everyone who runs an online business should be familiar with this subject. And certainly he remembers it at the moment when his website or shop just doesn’t work. Then the business, colloquially speaking, lies and does not earn money. And yet breakdowns happen. Is it possible to avoid it somehow and how much can it cost?

When running an analog business, you have to counteract many unexpected problems. When you’re involved in wedding photography, you’ve got a few SLRs and cameras each. Running a transport company, you take care of a spare car, spare parts or a quick service with a mechanic. Some risks you take and others you can counteract. In the digital world it is exactly the same, which unfortunately you forget about or do not know.

What’s worse, the marketing slogans like “cloud always works, upload it to the server it won’t disappear” don’t help to understand the subject of availability.

It’s time to face the topic of Internet service availability, because as you can probably guess, they can also not work.

What is availability?

Availability is one of the basic measures that determines the degree of failure resistance of a system. Availability is the time during which a system or service operates without failure.

If we take 1 day as a unit of time and the availability of the system was 75%, it means that it has been providing services for 18 hours and not for the remaining 6 hours. When it comes to the availability of IT systems, the accepted availability class is also used.

Unavailability
AvailabilityClassTypeYearlyMonthlyWeeklyDaily
90%1Unmanaged36d 12h 34m 55,2s3h 1g 2m 54,6s16h 48m 0,0s2h 24m 0,0s
99%2Managed3d 15h 39m 29,5s7h 18m 17,5s1h 40m 48,0s14m 24,0s
99,9%3Well Managed8h 45m 57,0s43m 49,7s10m 4,8s1m 26,4s
99,99%4Fault Tolerant52m 35,7s4m 23,0s1m 0,5s8,6s
99,999%5High-Availability5m 15,6s26,3s6,0s0,9s
99,9999%6Very-High-Availability31,6s2,6s0,6s0,1s
99,99999%7Ultra-Availability3,2s0,3s0,1s0,0s

The availability class really says how many nines there are. The larger the class, the more available the system is. Popular network services, hostings offer availability ranging from 90% to 99.9%. Of course the scale of availability is open, so nothing stands in the way of offering services with availability, e.g. 95%.

What is Service Level Agreement?

SLA, i.e. Service Level Agreement, is an agreement that specifies, among other things, the level of availability of services provided. Such an agreement may also contain clauses on contractual penalties for failure to meet the required availability. An example may be AWS cloud services.

Usługa AWSSLAWhen the compensationAmount of compensation
EC299,99%Less than 99.99% but equal to or greater than 99.0%10%
Less than 99.0% but equal to or greater than 95.0%30%
Less than 95.0%100%
RDS99,95%Less than 99.95% but equal to or greater than 99.0%10%
Less than 99.0% but equal to or greater than 95.0%25%
Less than 95.0%100%
S399,9%Less than 99.9% but greater than or equal to 99.0%10%
Less than 99.0% but greater than or equal to 95.0%25%
Less than 95.0%100%

Looking at the SLA in Amazon Web Services, it is important to remember that it is about the services they offer, not the virtual machine. The SLA for an EC2 service does not imply the availability of this server, but a service that allows it to be created and managed. This means that if the machine itself fails, you still have a service available to run a back-up server.

What availability is right for me?

“My website must always work”, “I’ve lost millions in 20 minutes of failure!” – how many times it’s been heard or read on the web. After all, each of us wants the highest possible availability of services. Remember that systems with 99.99% availability can cost a lot. So how to assess which availability will be appropriate?

First of all, start by analysing the risks associated with the unavailability of a particular service. Breakdowns like to appear in the least expected moment, e.g. during increased sales on Black Friday. The fact that a website or server has worked perfectly for the last few months does not necessarily mean that this will happen tomorrow.

Every business owner must ask himself the same question, for how long a given service may not work so that it is not dangerous? Perhaps if you run an online store selling one size of nails you won’t lose too much when it’s turned off for a few hours. It is different when we run, for example, a system monitoring the vital signs of a patient. Here, there can be no question of downtime in operation.

Once you have determined the maximum and acceptable unavailability for you, you can choose the right service to meet these requirements. High availability is not free and can cost a lot of money, so sometimes it’s worth taking the risk per frame.

How much can it cost me?

Definitely more than a single server or hosting. Guaranteed high availability, e.g. class 4 (monthly about 4 minutes of unavailability of the service), requires a properly developed infrastructure, including

  • more servers, more network devices – the hardware likes to play tricks, so the machines should at least be duplicated,
  • administrators working in shift mode – failures do not take time off and can occur even on holidays,
  • appropriate software – the website itself must be prepared to work on multiple servers at the same time.

There are also backup Internet connections, the possibility of exchanging server components during its operation, backup power supply, monitoring, infrastructure testing and so on. There is a lot of it.

It will not be a great discovery to say that a website based on one server works as long as it doesn’t break down.

What’s going to break down? For example: hard drive, power or cooling. We can buy a more expensive server that has redundant power supplies and fans, and drives working in the arrays. This way we increase the availability of the server, but it is still one server that can be subject to other failures.

Let’s calculate the SLA and the costs

To illustrate how the price of the service can vary depending on the availability of servers, let’s count it for an example of an online store based on some popular CMS.

For simplicity, things like software, administrator response time, network availability and how the servers are set up in the data center will be omitted. Adding a second server to the infrastructure will not increase our availability if it is plugged into the same power source as the first one. A power failure will simply put both machines together.

The Internet shop operates on two servers: a web server and a server with a database. SLA for each of these machines is 99.5% per month, which gives us 3h 39m 8.7s of unavailability. Exactly! And what is the SLA for the whole?

Architecture of the online store
Architecture of the online store

With two servers arranged in series, the whole system fails when it fails:

  • Web server doesn’t work
  • Database server does not work
  • Both servers do not work

Mathematics here is absolute and SLA for such infrastructure is a multiplication of individual SLA of each of the servers.

SLA = 99,5% * 99,5% = 99%
ServerCostSLAUnavailability
WWW200$99.5%3h 39m 8.7s
DB200$99.5%3h 39m 8.7s
Total costTotal SLATotal unavailability
400$99%7g 18m 17.5s

Over 7 hours of unavailability of the service is practically one working day. Many online shops cannot afford such a downtime. Therefore, some redundancy would be useful.

To increase the availability of the entire store, we have added an additional web server and Loadbalancer, which will direct traffic to the working device. Loadbalancer can be a much weaker and therefore cheaper machine.

Infrastructure with additional servers and traffic separation.
Infrastructure with additional servers and traffic separation.

Let us now calculate the SLA for such infrastructure. We already have parallel web servers connected here. The formula for calculating the SLA for such infrastructure will look like this:

SLA = 0,999 * (1 - (1 - 0,995)2) * 0,995 = 99,4%
ServerCostSLAUnavailability
Loadbalancer100$99.9%43m 49,7s
WWW200$99.5%3h 39m 8.7s
WWW200$99.5%3h 39m 8.7s
DB200$99.5%3h 39m 8.7s
Total costTotal SLATotal unavailability
700$99,4%4h 22m 58.5s

Over 4 hours of unavailability is still a lot, but the incident is better than in the previous example. Looking at the architecture, we still have a few elements whose failure causes a big problem. We have one database and a loadbalancer, whose disabling will result in the lack of access to the other servers.

So let’s introduce redundancy at each level and see the SLA and the price of the infrastructure.

Internet shop based on redundant servers in every layer.
Internet shop based on redundant servers in every layer.
SLA = (1 - (1 - 0,999)2) * (1 - (1 - 0,995)2) * (1 - (1 - 0,995)2) = 99,99%
ServerCostSLAUnavailability
Loadbalancer100$99.9%43m 49,7s
Loadbalancer100$99.9%43m 49,7s
WWW200$99.5%3h 39m 8.7s
WWW200$99.5%3h 39m 8.7s
DB200$99.5%3h 39m 8.7s
DB200$99.5%3h 39m 8.7s
Total kosztTotal SLATotal unavailability
1200$99,99%4m 23,0s

In this way, a satisfactorily high SLA was achieved. Unfortunately, the price is also high, almost two and a half times higher. And remember that this is only a simulation, which omits many elements such as administrative support, or the availability of the network itself.

Summary

When deciding on a solution with a specific availability, you should consider the real impact of every hour of downtime on your business. By running a small blog, we can choose a simple hosting, and by building a large web application for thousands of customers, we can choose redundant solutions.

We can also count on luck that one server will always work. Failures are like falling on a hard sidewalk – from a high height – they hurt.