This event has ended. Visit the official site or create your own event on Sched.
View analytic
Tuesday, July 12 • 11:40 - 12:20
Alerting for Distributed Systems—A Tale of Symptoms and Causes, Signals and Noise

Sign up or log in to save this to your schedule and see who's attending!

Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important. However, a distributed system is designed to tolerate local failures, and a human should only be paged on real or imminent problems of a service, ideally aggregated to one meaningful alert per problem. The definition of a problem should be clear and explicit rather than relying on some kind of automatic “anomaly detection”. Taking historical trends into account is needed, though, to detect imminent problems. Those predictions should be simple rather than “magic”. Alerting because “something seems weird” is almost never the right thing to do.

SoundCloud's long way from noisy pagers to much saner on-call rotations will serve as a case study, demonstrating how different monitoring technologies, among them most notably Prometheus, have affected alerting.

avatar for Björn Rabenstein

Björn Rabenstein

Production Engineer, SoundCloud Ltd.
Björn is a Production Engineer at SoundCloud and a Prometheus developer. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.

Tuesday July 12, 2016 11:40 - 12:20
Pembroke Room

Attendees (63)