outage

Failure is an option

As Software Engineers, we often tend to be overly optimistic about software. In particular, it often happens that we underestimate the probability of systems and components failures and the impact this kind of events can have on our applications. We usually tend to dismiss failure events as random, unlikely and sporadic. And, often, we are proven wrong. Systems do fail indeed. Moreover, when something goes wrong, either it’s barely noticeable, or it leads to extreme consequences. Take the example of the recent AWS outage: everything was caused by a mistake during a routine network change. Right now, some days after the event, post-mortem analyses and survival stories count in the dozens. There is one recurring lesson that can be learned from what happened.