In the beginning of Heroku Postgres, low pager volume were a sign of broken monitoring, not a healthy fleet. Running customer databases on AWS, we needed an automated way to resolve routine failures, so a state machine based framework was slowly grown, to handle issues with server and service availability, filled disks, failed backups, failed server boots, stuck EBS volumes and other incidents that would otherwise wake an engineer. A framework emerged as the basis for writing flexible and robust incident resolution automation, and it has grown to now power High Availability and Disaster Recovery for the Heroku Postgres and Heroku Redis services.
This talk will be comprised of: - An overview of how to run Postgres and Redis in the cloud - An incomplete survey of what can go wrong in the cloud and how we fix it - An introduction to state machines - How to convert playbooks into state machines - Proper panics and circuit breakers for your automation - Limits of automation - Limits of people - What we would change if we could do it again - How to continue scaling up and out
Greg Burek is a senior engineer with the Heroku Data team, which runs Heroku Postgres and Heroku Redis. I am a core contributor to and on the pager rotation for the automated incident resolution system referenced in this talk.