Perspective on multi-cloud (part 3 of 3) — Availability and multi-cloud

Published in

Heptio

7 min readMay 16, 2017

In the first post in this series I talked about the cloud providers and investing in multi-public cloud strategies. The second post explored topics around on-premises and public cloud. In this post I dig into the topic of achieving availability in the cloud.

By way of background, my perspective on this topic is colored by years spent working at Google. Google has developed a pretty nuanced view on service availability. The dominant concern of the Google SRE (site reliability engineer) is achieving a negotiated SLA (service level agreement). They balance the practicalities of building and delivering an evolving technology with achieving an acceptable level of availability.

As companies large and small explore the topic of multi-cloud, the impact of single provider failures and the need to be resistant to provider outages comes up. I hope to provide a useful framework to think about this.

Obviously there is a lot more to this topic than I can cover in one post, so we will focus here on the availability of the compute resources that make up a service. Storage systems availability considerations are something for a future post.

Observation 1: It is really important to understand what exactly availability actually means.

Given a top line availability goal for a service, there are two things that dominate the equation: MTTF (mean time to failure) and MTTR (mean time to recovery). The overall availability as observed by users of a service is approximated by the formula A = MTTF / (MTTF+MTTR).

If you are down for ten seconds a day for ten years there is a chance no one will notice. A customer refreshes their website page 10 seconds after receiving an HTTP 404 or 503 (or whatever). The app picks up where they left off and likely think nothing of it (assuming you don’t loose a bunch of state in the process: there is no forgiveness for service outage impacting data underlying state). If however you are down for a whole business day during office hours, chances are people are going to notice and talk about it. You may well lose business. Interestingly over a 10 year timeframe both are examples of the same fundamental SLA: 99.99% availability (“four 9's”).

It worries me that people building real world systems often fail to separate the idea of regularity of service interruption with the impact of a single service interruption event. The relentless focus on driving down MTTF introduces complexities that can ironically increase MTTR considerably.

Friendly piece of advice 1: Don’t conflate MTTF and MTTR. In most real world environments you don’t have enough real world experience with your service to truly understand your MTTF. Unless you are particularly unlucky, you probably have not yet seen a five standard deviation outage event yet. MTTR is something that you can perfectly understand (because you can test it). Managing it down is smart.

For a lot of folks running non-business critical workloads, a dead simple (even non-HA) clustering technology that can quickly be reconstituted from first principles is sensible vs introducing the operations complexity of more available systems. It is smart to step back and figure out what you want to accomplish in terms of top line availability, and what you are willing to pay to go beyond that level. Reflexively focusing on the most theoretically available solution may introduce more operational complexity than you should be taking on.

Observation 2: Clustering technologies provide a good starting to drive availability, but are still subject to the vagaries of zonal outages.

One of the lovely things about modern orchestration frameworks (like Kubernetes) is that they significantly reduce the MTTR for most application level outages. MTTF looks pretty much the same as it would running on a VM, but MTTR goes down from potentially hours (if operators are involved in recovery operations) to seconds in most cases. You can of course accomplish this in a traditional VM based flow, but even then the difference is pretty stark since VM provisioning and configuration can take minutes vs the second it takes to bounce a container to a new node.

Many customers first experiencing Kubernetes are giddy because what would traditionally be a pageable application component down event just goes away (the cluster detects failure and restarts the container). You can see a significant availability boost for a traditional application by containerizing, making sure the health checking model is correct and deploying it in a clustering environment.

Friendly piece of advice 2: for most common deployments, application level outages will dominate the MTTF equation. Think about using modern orchestration technologies to smooth over application outages, and isolate your deployment from individual infrastructure (node) failures.

You are however still subject to the realities of zone based outages, and have to develop a plan accordingly. There is a temptation to create dynamic systems that spread work across multiple zones and autonomously adapt to outages. For sophisticated users this represents a lot of promise, but adds risk.

Observation 3: It is important to understand the failure domains of your system, and recognize that any technology that federates across those domains introduces a potential for correlated outages.

Almost every significant outage event I have seen has been as a result of operator error. Often as not an operator has fat fingered a command and pushed broken configuration to a large number of nodes. While federation technologies that stitch deployments across multiple failure domains solve for one problem it is a very sharp tool indeed. When a single control plane stretches across zones, the “blast radius” of such a mistake can now also stretch across zones.

Friendly piece of advice 3: A modest substitution of human toil for fancy automatic federation will often insulate you from operator driven correlated outages. Simpler systems in the real world will often achieve better overall availability, particularly when you factor in the MTTR part of the availability equation.

In the case of Kubernetes, the pattern I favor is to run two (or more) independent clusters in different failure domains (zones). A professionally (cloud provider) run load balancer in front of them spreads load between them. Within each zone you implement an isolated scaling mechanism that allows each cluster to grow should all the load be delivered to a single zone. This introduces toil (yes, you are doubling the number of clusters you have to run and operate), but you also have the flexibility to run green/blue deployments and the work that is being done in the community to manage down the operations complexity of a cluster will continue to make it easier to live with. More importantly if an operator breaks one zone, a simple configuration update you can recover with a very low MTTR.

Friendly piece of advice 4: When you use a technology that federates or spreads load across failure domains to achieve very low MTTF’s, consider using something built and run by the pros (cloud providers). Build in controls (technical or procedural) to avoid pushing configurations that impact multiple failure domains simultaneously.

Observation 4: Given the current state-of-the-art, most users will achieve best day-to-day top line availability by just picking a single public cloud provider and running their app on one infrastructure.

At the end of the day, for the vast majority of normal users, running in normal situations, cloud provider outages are not going to dominate the availability equation for applications. While multi-failure domain deployment (within a single cloud provider) is essential in many real world applications, the cost and complexity of running a single application across multiple public clouds won’t benefit the average user today. Some technically sophisticated organizations could achieve better general availability by running across clouds, but in reality the systems necessary to do this will introduce more problems than they solve.

Friendly piece of advice 5: Unless you are on the absolute bleeding edge of distributed systems design, and have deep operations DNA, you are not going to get better top line availability for a service by actively running it across multiple public cloud providers.

Observation 5: The world is a strange place and there are worse things that can happen than a cloud service going down for a day.

The cloud providers do a very good job managing security, availability, establishing disaster recovery protocols, etc than any but the most wealthy, paranoid and operationally mature organizations. There are two things that I do think about a fair bit though (but then I tend to be a bit paranoid) :

(1) Most of the outages we see are simple operator error. Services go down because someone makes a mistake. The providers are very good about understanding this and creating better controls each time it happens. Every service outage improves the cloud service. But what would happen if a smart bad actor managed to infiltrate the control plane of a cloud provider? To be clear this is far less likely than the same happening to someone running their own on-premises software but with systems this large and intricate it isn’t clear how quickly they could be restarted (the TTR could be very high indeed).

(2) We haven’t seen the final state of the cloud giants. Microsoft and Google are charging hard and each have unique advantages over the incumbent, if for no other reason that they started later with public cloud and have the advantage of building from a newer technology base.

While I am hardly unbiased, I will close this series with one last piece of advice.

Last piece of friendly advice: The landscape is still evolving and it seems likely that we haven’t seen the worst of all outage events, or even settled into which cloud provider yields the best overall availability. I don’t advocate actively running apps across multiple clouds today since the technology needed isn’t built and battle tested. It likely creates more problems than it solves. It would however not hurt to have a playbook that would allows you to your critical services up and running in another cloud if you really, really needed to.
Consider: (1) betting on a framework that allows you to relatively quickly turn up a critical service somewhere else to create a business survivable MTTR, (2) think about technology that supports the propagation of critical data to another cloud for safe keeping, (3) actually try it from time to time.

Hope this was helpful. Do reach out to us at Heptio (inquiries@heptio.com) if you want to talk with us about this. Also follow @heptio on Twitter.