Imagine you are a Site Reliability Engineer (SRE) that is tasked with examining if an application deployment in your cluster is enabled for high availability. Where would you start? What are the main aspects you should analyze?

In this article, we argue that most transient failures can be mitigated by…

Ali Kanso

Ali is a Senior Software Engineer at Microsoft, He previously worked for IBM Research and Ericsson Research. He holds a Ph.D. in computer engineering.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store