Christian has been a software engineer for over 10 years and a tech person for more than 20. He is an occasional triathlete and avid cyclist. He lives & works in Chicago with a wife, son, and 3 cats. He enjoys reading long walks on the beach and other activities that help to get over the minimum character threshold.
Distributed systems have an emergent property of hidden instability which typically require a confluence of triggers and manifest as black-swan crashes.
What does it mean for a system (your app) to be meta-stable? How do you manage metastability, how do you diagnose it, and how do you develop strategies to mitigate the (often) catastrophic failures that result from metastable crashes?
As systems become more distributed and grow in complexity, there is often a tradeoff between speed and stability. Sometimes this tradeoff is explicit, tech debt, but sometimes you aren’t even sure it’s there. And sometimes that tradeoff can spectacularly blow up in your face.
By talking through a long smoldering incident that manifested in several outages of varying severity, this talk will cover what we’re even talking about when we talk about metastability, what steps we took to mitigate the damaging effects, and how we are working to ensure it’s less likely to happen in the future.