As an SRE, outages are no fun. They directly impact your company’s brand value, market capitalization. For example, Facebook lost 7 billion
for the six-hour outage it suffered.
In this case, FB lost access to the servers to apply a fix, and they had to visit their Datacenters to make changes physically. Facebook has also released an update here
. I expect more to come in the months to come by as well as probably an OSS project 😅
How do you Detect, Debug & root cause an outage like this?
For starters, it is hard to detect what could go wrong with increasing complexity in the products we use and the designs that we use.
You could use several monitoring tools to detect changes happening in your network. Some of them are proactive in reporting a change.
Apart from these, proper Chaos Engineering practices would help in detecting probable process, design, product level bugs. FB calls these “storm” drills.
How can you do these “storm” drills at your org? Are these available in a templatized form to detect common issues in architecture?
You can refer to ChaosMonkey
. Especially, LitmusChaos has a fantastic community, designed with Kubernetes in mind.