This event has ended. Visit the official site or create your own event on Sched.
Back To Schedule
Tuesday, July 12 • 09:40 - 10:20
Production Improvement Review (PIR)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart. The greater the virtual debt, the bigger the bite.

We’ll also discuss how we expose the data in near real time dashboards that allow team members from the director all the way down to the IC to see relevant views and take the appropriate action. IC’s can find incomplete postmortems they need to work on, a service director can view his accumulated debt to prioritize resources, or a dev manager can review virtual debt to ensure the team is conducting rigorous postmortems. By analyzing historical outages, we’ve found that missed detection leads to an exponential increase in mitigation times. We’ve collected a myriad of other insights by mining through historical outage data and using charts and creative visualizations to surface insights, including what surprising proxy metrics we’ve discovered that influence uptime, and some specific examples of actions we’ve taken to improve service quality based on the data.

avatar for Martin Check

Martin Check

Principal Engineering Manager, Microsoft
I work on problem management, incident management, and delivering services for SRE. I'm particularly interested in uplifting services from traditional operating models into Devops/SRE operating models, collating and analyzing data on service health to identify themes, and developing... Read More →

Tuesday July 12, 2016 09:40 - 10:20 IST