System failures happen. Hardware dies, software crashes, capacity gets exceeded, and any of these things can cause unexpected effects in the most carefully-architected systems.
At Heroku, we deal with complex systems failures. We’re running a platform as a service: our whole business model requires us to provide operations for our customers so they don’t have to do it themselves. We run over a million postgres databases, tens of thousands of redis, and hundreds of thousands of dynos on thousands of AWS instances.
What do we get out of these incidents? Pain and suffering? Yes, sometimes. We also get data about how our systems are actually working. We get ideas for making it work better. And sometimes we get ideas for whole new products.
In this talk, I’ll discuss how to take the bad of a system failure and turn it into good: better products, more reliable platforms, and less stressed engineers.