Loading…
This event has ended. Visit the official site or create your own event on Sched.
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monday, July 11
 

08:00

Morning Coffee & Tea
Monday July 11, 2016 08:00 - 09:00
Pre-Function Area

09:00

Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
The principles and constructs of DevOps are pervading the industry and have lit the path for the capability to execute with both speed and quality in balance while managing hockey stick growth. Most companies and organizations are nowhere near “goal state”. As this audience knows, rubbing SRE, and the devops principles into existing companies and service code bases involves cultural engineering as well as deep tech investments. There is no larger enterprise cloud & consumer company in the world than Microsoft, and we’re on the journey now to investing heavily to shift to SRE. The core of Microsoft was not born in the cloud, but born in the first gen consumer (PC), and then rose to dominate Enterprise. There are a handful of services at Microsoft that were born in the cloud and have scaled massively, with Bing Search being the “grand daddy” with 12 years of experience in the pocket from which we have learned much. Microsoft Azure is the Enterprise Cloud. We are making an enormous investment to run at galactic scale to be the infrastructure for the world’s infrastructure and compete relentlessly in the market with top competitors. In this talk we will compare and contrast the journey within Bing to the current state of execution and how we are taking the lessons from that experience, inspiration from industry, and learnings to date as we build SRE within the Microsoft Enterprise services (Azure) which were not born in the cloud, but which have had enormous market success to date and are the future of our company.

Speakers
GV

Greg Veith

Greg is the Director of the Azure Site Reliability Engineering team in Azure, Microsoft’s cloud infrastructure - a multi-billion-dollar business that is the foundation of the company’s service offerings. Azure is deployed across all geographies and used worldwide by millions of customers. In addition to 1st party workloads such as Office 365, Bing, and Dynamics, Azure is leveraged by multi-billion-dollar Fortune 500’s, small and medium size... Read More →


Monday July 11, 2016 09:00 - 09:40
Lansdowne+Pembroke

09:40

Doorman: Global Distributed Client Side Rate Limiting
Doorman is a Google developed system for global distributed client side rate limiting. We are in the process of open sourcing it. With Doorman an arbitrary number of globally distributed clients can coordinate their usage of a shared resource so that the global usage does not exceed global capacity.

This presentation:
  • Describes the fundamentals of the Doorman system
  • Explains the concepts of the RPC protocol between Doorman components
  • Shows code examples of Doorman configurations and clients
  • Shows graphs of how Doorman clients ask for and get capacity, and how this sums up globally
  • Explains how Doorman deals with spikes, clients going away, servers going away
  • Explains Doorman's system reliability features
  • Points to the Doorman open source repository
  • Explains the Doorman simulation (in Python) which can be used to quickly verify Doorman's behaviour in a specific scenario

Speakers
JV

Jos Visser

Jos Visser has been working in the field of reliable and highly available systems since 1988. Starting as a systems programmer (MVS) at a bank, Jos's >25 year career has seen him working with a variety of mission critical systems technologies, including Stratus fault-tolerant systems, HP MC/ServiceGuard, Sun Enterprise Cluster, and Linux Lifekeeper. Jos joined Google in 2006 as an engineer in the Maps SRE team. Since then he has worked in a... Read More →


Monday July 11, 2016 09:40 - 10:20
Lansdowne+Pembroke

10:20

Break with Refreshments
Monday July 11, 2016 10:20 - 11:00
Pre-Function Area

11:00

Panel: What is SRE?
Moderators
JL

John Looney

SREconEU Program Chair

Monday July 11, 2016 11:00 - 11:40
Lansdowne+Pembroke

11:00

Data Center Networks: The Rip van Winkle Edition
Limited Capacity seats available

If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven. Networks were no longer slow as molasses: to deploy, manage and upgrade. He'd find some things had stayed the same (IPv4 still ruled the roost), and some others not so much. He would be puzzled by the terminology and the discussions as he wandered the hallways. SDN, Open networking, Openflow, microservices, Ansible, Puppet, Kubernetes, and so on.

This tutorial is an attempt to bring folks up to speed on whats happened with networking in the past 10 years or so, especially in the data center, concluding with some thoughts on why exciting times lie ahead. The talk will be roughly divided into the following sections:

  1. Who Moved My Network ? What's causing all this turmoil in networking
  2. Solutions: Requirements, Terminology, Pros and Cons
  3. Changing Landscape: Network Topologies
  4. Changing Foundation: Network Protocols
  5. Changing Operations: Modern Operations
  6. Changing Residents: Modern applications and their implications on networks
  7. Reading Tea Leaves

The tutorial will include demos and hands on work with some modern tools.

The audience is expected to be aware of basic networking (bridging, routing, broadcast, multicast etc.).

The key takeways from this talk will be:

  • An understanding of the forces behind the changes in data center networking
  • The morphology an physiology of modern DC networks
  • What these changes presage of the future

Some preliminary ideas for hands on work:

  • Build multi-host container network
  • Build and configure a nxm CLOS topology with BGP
  • Design a CLOS for x number of servers given certain box specifications

Speakers

Monday July 11, 2016 11:00 - 17:00
Ulster

11:00

Staring into the eBPF Abyss
Limited Capacity seats available

eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.

In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.

Next, you will hack on some of the bleeding edge tools in the BCC toolkit, and build a couple of simple tools of your own. You'll be able to pick from a curated list of GitHub issues for the BCC project, a set of hands-on labs with known "school solutions", and an open-ended list of problems that need tools for effective analysis. At the end of this workshop, you will be equipped with a toolbox for diagnosing issues in the field, as well as a framework for building your own tools when the generic ones do not suffice.


Speakers
avatar for Sasha Goldshtein

Sasha Goldshtein

CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft C# MVP and Azure MRS, a Pluralsight author, and an international consultant and trainer. Sasha is a book author, a prolific blogger and open source contributor, and author of numerous training courses including .NET Debugging, .NET Performance, Android Application Development, and Modern C++. His consulting work revolves mainly around distributed architecture, production debugging and... Read More →


Monday July 11, 2016 11:00 - 17:00
Munster

11:40

Building and Running SRE Teams
General Stanley McChrystal led the Joint Special Operations Task Force in Iran in the mid to late 2000's. While in command of the Task Force, he was responsible for transforming an organization which was dominated by Taylorist reductionism into an agile, responsive network which could dynamically adapt and win in the threat landscape around them. In his book Team of Teams: New Rules of Engagement for a Complex World, he outlines the key learnings that emerged from that process. The same issues and challenges face site reliability engineers and managers for SRE teams as we cope with the complexity of our own and partner ecosystems. In this talk, I will highlight the key points fromTeam of Teams: New Rules of Engagement for a Complex World and show how the solutions that helped make the Task Force successful can be applied to make SRE teams succeed too.

Outline:
Taylorism: Efficiency and Command Structure
Teams: Purpose Over Procedure
Shared Awareness: Democratizing Information - Everyone "Needs to Know"
Empowered Execution: "Eyes on - Hands Off"
Lead Like a Gardener: Fostering and Cultivating Organizational Culture

Speakers
avatar for Kurt Andersen (LinkedIn)

Kurt Andersen (LinkedIn)

Program Committee, LinkedIn
Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M³AAWG.org). He has spoken at M³AAWG, Velocity, SREcon(US & EU), and SANOG on various aspects of reliability, authentication and security. | | You can also blame me for the... Read More →


Monday July 11, 2016 11:40 - 12:20
Lansdowne+Pembroke

12:20

Conference Luncheon
Monday July 11, 2016 12:20 - 13:40
Sussex Restaurant

13:40

The Production Engineering Lifecycle: How We Build, Run, and Disband Great Reliability-focused Teams
Engineers focused on reliability and scalability under real-world conditions are a scarce resource in any organization. How do we know where to deploy them, and how do we use them in the best possible way? In Facebook's Production Engineering team, we have this problem all the time, and we've dealt with it a variety of ways throughout the years. Some of these ways have worked better than others, and we'd like to share what works and what hasn't.

In this talk, we will share our approaches to when to start a production engineering team, how to integrate that team into the existing development team, how to prioritize and divide work between engineers, and even when to disband or merge the team. We will also discuss practical matters such as how we divide on call responsibilities and roadmap items, and how we integrate engineers in multiple locations and time zones.

Speakers
AR

Andrew Ryan

Andrew Ryan is a founding member of the Production Engineering team at Facebook, where he has worked since 2009. In between stints of being oncall and fixing his broken unit tests, he has worked on and helped bootstrap a number of teams. | | I'm a frequent speaker at internal events at Facebook, but haven't done any speaking outside in the past few years. | I spoke at the Hadoop Summit conference in 2011, but there doesn't seem to be any... Read More →


Monday July 11, 2016 13:40 - 14:20
Pembroke Room

14:20

How to Improve Your Service by Roasting It

In many companies, including Microsoft, SRE is not yet an integrated part of the operational landscape. Instead it is being actively adapted into mature companies. Our team has been working to develop new and interesting ways to introduce SRE and its tenets to an organization with many different operational approaches ranging from IT Ops to DevOps.

The process of introducing SRE has proven to be quite complex and socially delicate: you can't go in to a team and just tell them they are doing things wrong. You need to find the right way to show a developer all the warts on their baby and motivate them to work with you on addressing them. Furthermore, you have to deal with their earnest desire to treat you as "just another ops team" who is only there to take the pager from them.
One of the tools we've used to enable the right conversations is to hold what we call a Service Roast. Named after the famous friar's club roasts, the goal is to establish a safe environment to dig into and expose those warts, wrinkles, design flaws, shortcomings, and problems everyone knows a service has but doesn't want to talk about. We can't help you if you won't tell us where it hurts.

To perform the Service Roasts, we've discovered some process, ground rules, a new role of impartial moderator, and some useful structure to host this kind of meeting. Thus far we've been able to obtain great insight into some of our services and more importantly created some very interesting (and lively) conversations.

To be sure, this is a high-risk activity, and shouldn't be done without careful consideration of the teams participating, but we'll present what we've learned about holding these roasts, guidance teams need for successful participation, and (importantly) why we don't use this approach everywhere.


Speakers
avatar for Jake Welch

Jake Welch

Principal Software Engineer, Microsoft
Jake Welch is a Site Reliability Engineer on the Microsoft Azure team in NYC. He has worked on large scale services at Microsoft for eight years, primarily in Azure infrastructure and Storage in software engineering/operational/managerial roles and on the major disaster on-call team. In 2014, he started the first SRE pilot within Azure. Prior to Microsoft, Jake worked as a developer building websites and automating backend business workflows... Read More →


Monday July 11, 2016 14:20 - 14:40
Pembroke Room

14:20

Flash Sale Engineering
From stores with ads in the Super Bowl to selling Kanye’s latest album, Shopify has built a name for itself handling some of the world’s largest flash sales. These high profile events generate write-heavy traffic that can be four times our platform’s baseline throughput and don’t lend themselves to off-the-shelf solutions.

This talk is the story of how we engineered our platform to survive large bursts of traffic. Since it’s not financially sound for Shopify to have the required capacity always running, we built queueing and page caching layers into our Nginx load balancers with Lua. To guarantee these solutions worked, we tested them with a purpose-built load testing service.

Although flash sales are unique to commerce platforms, the lessons we learn from them are applicable to any services that experience bursts of traffic.

Speakers
avatar for Emil Stolarsky

Emil Stolarsky

Production Engineer, Shopify
Emil is a production engineer at Shopify where he works on performance, the production pipeline, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's shivering over a spiked cup of coffee in the great Canadian north.


Monday July 11, 2016 14:20 - 14:40
Lansdowne

14:40

What SRE Means in a Start-up
Speakers
avatar for Brian Scanlan

Brian Scanlan

Engineering Manager, Intercom


Monday July 11, 2016 14:40 - 15:00
Pembroke Room

14:40

Managing Up and Sideways as an SRE
Ever have a bad manager? Or have a project go off the rails but feel powerless to stop the trainwreck? I'll talk about why knowing a little bit about management can help you as an individual contributor or tech lead, and talk about a few ways that you can help yourself and your SRE team without ever formally managing yourself.

Speakers
LF

Liz Fong-Jones

Liz is a Senior Site Reliability Engineer at Google and manages a team of SREs responsible for Google's storage systems. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.


Monday July 11, 2016 14:40 - 15:00
Lansdowne

15:00

Break with Refreshments
Monday July 11, 2016 15:00 - 15:40
Pre-Function Area

15:40

Tier1 Metamorphoses
One of the Linkedin’s key cultural values is Career Transformation: Helping people you manage build new abilities and skills, work with them to define their career goals and support their efforts to accomplish them. Applying this to a Tier1 support team is challenging.

A Tier1 support manages the day-to-day operations of your business and engages higher tiers when needed. They end up with a very wide field of view but very little depth of knowledge. They are always the bearers of bad news and only noticed when something is broken. The morale of such teams is notoriously low. Furthermore, capitalizing on this experience for the business is a challenge because of retention issues stemming from low morale. This was Linkedin in 2013.

Today, we have transformed our tier-1 into the foundation of our SRE organization as an incubator for our SREs. Our objective was to add depth to their breadth: they are part of the resolution instead of just passing on bad news, their work is more valued, and they have gained the trust of higher tiers. As a result, team morale is at an all time high. Investing in automation, training, and mentorship was the key to their transformation. This is Linkedin today.

This session will discuss our roadblocks, learnings and achievements.

Speakers
NM

Nina Mushiana

Being the NOC manager, I envisioned and executed this transformation. I would like to share my experience in this transformation journey. | Multiple talks within the organisation | I have submitted this proposal for SRECon in US and it has been accepted.


Monday July 11, 2016 15:40 - 16:20
Pembroke Room

15:40

Capacity Planning at Scale
Have you ever bought machines? What if you need to even build datacenters? How can you predict how many you are going to need in two years from now? How can you make efficient use of all the resources you suddenly got? What if you are missing some resources? Can we automate all these stuff and integrate with our continuous delivery?

These are just a few questions anyone planning a large computer fleet always make. This talk will cover some of the approaches and tooling that can be used to effectively plan for the demand of services and how to cover it on the most efficient manner.

Speakers

Monday July 11, 2016 15:40 - 16:20
Lansdowne

16:20

Panel: Brownfield SRE
Moderators
Monday July 11, 2016 16:20 - 17:00
Pembroke Room

16:20

Load Shedding—Approaches, Principles, Experiences, and Impact in Service Management
Cover the experience gained in developing load-shedding solutions and the impact in service management, at large scale.

Speakers
avatar for Acacio Cruz

Acacio Cruz

Director - Frameworks & Production Platforms, Google
Google SRE Manager since 2007, Frameworks Eng. Director mid-2016


Monday July 11, 2016 16:20 - 17:00
Lansdowne

17:30

Conference Reception, Sponsored by Google
Sponsors
avatar for Google

Google

Gold Sponsor
Google is a global technology leader focused on improving the ways people connect with information. Google's innovations in web search and advertising have made its website a top internet property and its brand one of the most recognized in the world. For more information, visit www.google.com/about.html.


Monday July 11, 2016 17:30 - 19:00
~Herbert

20:00

Open Source Distributed Load Balancing
Presented by Stefan Safar, Seznam.cz

Monday July 11, 2016 20:00 - 21:00
Ulster
 
Tuesday, July 12
 

08:00

Morning Coffee & Tea
Tuesday July 12, 2016 08:00 - 09:00
Pre-Function Area

09:00

Incident Response @ FB, Facebook's SEV process
Facebook is famous for our MOVE FAST AND BREAK THINGS motto. An important part of MOVING FAST while sustaining reliable systems is to FAIL FAST. This talk presents Facebook's strategy for Incident Response & Root Cause Analysis called the *Site Event (SEV) Process*. We'll describe everything from Incident Triage to Remediation paying special attention our desire fix things quickly and working to avoid having the same outage twice.

Speakers
avatar for Gareth Eason

Gareth Eason

Technical Program Manager, Facebook
Gareth works as a Technical Program Manager with Facebook, focusing on designing and building their growing global network and content delivery infrastructure. Combining experience of systems architecture with networking and telecoms, Gareth has worked with Nokia, Cable & Wireless, HEAnet and Google. Come ask about Linux systems, care and feeding of large systems, CDN infrastructure or the successful use of Raspberry Pis for things they were... Read More →


Tuesday July 12, 2016 09:00 - 09:40
Lansdowne

09:00

The Many Ways Your Monitoring is Lying To You

Monitoring and dashboarding systems are crucial to understanding the behavior of large distributed systems. But monitoring systems can lead you on wild goose chases, or hide issues. In this talk, I will look at some examples of how a monitoring system can lie to you – in order to sensitize the audience to these failure modes and encourage them to look for similar examples in their own systems.


Speakers
SK

Sebastian Kirsch

Sebastian Kirsch joined Google as a Site Reliability Engineer in 2006. He has worked on Google's web crawler, the click log processing systems, and Google Maps, as well as supporting product and feature launches across most of Google's product areas, from web search to Youtube. He currently manages one half of the team that runs Google Calendar and Google Sites.


Tuesday July 12, 2016 09:00 - 09:40
Pembroke Room

09:00

Accident Models in Post Mortems
Limited Capacity seats available

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

Speakers
avatar for Will Gallego

Will Gallego

Engineer, Etsy
I'm an engineer in Infrastructure at Etsy with focuses on database scaling, distributed workloads (typically with Gearman), and delivering optimized photos to our end users. | | I'm a proponent of a free and open web, blame awareness in incident investigations, and pronouncing gif with a soft "g".
avatar for Miriam Lauter

Miriam Lauter

Software Engineer, Etsy
I'm a software engineer on Etsy's payments team and a summer 2014 Recurse Center alum. Outside work, I'm an avid rock climber and 99pi podcast listener.


Tuesday July 12, 2016 09:00 - 10:20
Munster

09:00

Statistics for Engineers
Limited Capacity seats available

Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
  • Is the system down?
  • Is user experience degraded for some percentage of our customers?
  • How did our query response times change with the last update?
Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations.

The tutorial focuses on practical aspects, and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX tools and the IPython toolkit.

This tutorial has been given at several occasions over the last year and has been refined and extended since, cf. Twitter #StatsForEngineers

Speakers
avatar for Heinrich Hartmann

Heinrich Hartmann

Analytics Lead, Circonus
As Analytics Lead at, Heinrich is driving the development of analytic methods that transform monitoring data into actionable information. He has been talking about Statistics for Engineers at various tech-conferences in the last year: SRECon15 / Velocity Amsterdam 2015 / LISA15 / Monitorama16. | Before joining Circonus, Heinrich worked as researcher and independent consultant. He earned his PhD in Mathematics from the University of Bonn... Read More →


Tuesday July 12, 2016 09:00 - 12:20
Ulster

09:40

Production Improvement Review (PIR)
Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart. The greater the virtual debt, the bigger the bite.

We’ll also discuss how we expose the data in near real time dashboards that allow team members from the director all the way down to the IC to see relevant views and take the appropriate action. IC’s can find incomplete postmortems they need to work on, a service director can view his accumulated debt to prioritize resources, or a dev manager can review virtual debt to ensure the team is conducting rigorous postmortems. By analyzing historical outages, we’ve found that missed detection leads to an exponential increase in mitigation times. We’ve collected a myriad of other insights by mining through historical outage data and using charts and creative visualizations to surface insights, including what surprising proxy metrics we’ve discovered that influence uptime, and some specific examples of actions we’ve taken to improve service quality based on the data.

Speakers
avatar for Martin Check

Martin Check

Principal Engineering Manager, Microsoft
I work on problem management, incident management, and delivering services for SRE. I'm particularly interested in uplifting services from traditional operating models into Devops/SRE operating models, collating and analyzing data on service health to identify themes, and developing chatbots, dashboards, and monitoring and correlation.


Tuesday July 12, 2016 09:40 - 10:20
Lansdowne

09:40

Practical Anomaly Detection and Alerting
This talk will debunk some common beliefs that in order to solve more [advanced] monitoring use cases and get more complete alerting coverage, we need complex, often math-oriented solutions such as machine learning and stream processing.

Instead we will set a clear context and pro's/cons for such approaches, and zoom in on how we can get dramatically better alerting, as well as make our lives a lot easier by using familiar concepts understandable to everyone such as basic logic, basic math and metric metadata, even for solving complicated alerting problems.

We will also see how we can optimize the overall experience of adjusting and maintaining alerting rules over time by focusing on the concept of an alerting IDE, exemplified by bosun. The talk will present techniques and concrete examples of how to do advanced alerting scenarios using these principles.

Speakers
DP

Dieter Plaetinck

My ops experience is mostly from working at netlog and vimeo. the last few years i've been working 100% on open source monitoring. For past talks (including LISA in seattle), see http://dieter.plaetinck.be/talks/


Tuesday July 12, 2016 09:40 - 10:20
Pembroke Room

10:20

Break with Refreshments
Tuesday July 12, 2016 10:20 - 11:00
Pre-Function Area

11:00

The Next Linux Superpower: eBPF Primer

Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.

Hopefully, you won't have to yearn for a lot longer. eBPF (extended Berkeley Packet Filters) is a kernel technology that enables a plethora of diagnostic scenarios by introducing dynamic, safe, low-overhead, efficient programs that run in the context of your live kernel. Sure, BPF programs can attach to sockets; but more interestingly, they can attach to kprobes and uprobes, static kernel tracepoints, and even user-mode static probes. And modern BPF programs have access to a wide set of instructions and data structures, which means you can collect valuable information and analyze it on-the-fly, without spilling it to huge files and reading them from user space.

In this talk, we will introduce BCC, the BPF Compiler Collection, which is an open set of tools and libraries for dynamic tracing on Linux. Some tools are easy and ready to use, such as execsnoop, fileslower, and memleak. Other tools such as trace and argdist require more sophistication and can be used as a Swiss Army knife for a variety of scenarios. We will spend most of the time demonstrating the power of modern dynamic tracing -- from memory leaks to static probes in Ruby, Node, and Java programs, from slow file I/O to monitoring network traffic. Finally, we will discuss building our own tools using the Python and Lua bindings to BCC, and its LLVM backend.


Speakers
avatar for Sasha Goldshtein

Sasha Goldshtein

CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft C# MVP and Azure MRS, a Pluralsight author, and an international consultant and trainer. Sasha is a book author, a prolific blogger and open source contributor, and author of numerous training courses including .NET Debugging, .NET Performance, Android Application Development, and Modern C++. His consulting work revolves mainly around distributed architecture, production debugging and... Read More →


Tuesday July 12, 2016 11:00 - 11:40
Lansdowne

11:00

The Structure and Interpretation of Graphs
Limited Capacity full

Speakers
NM

Niall Murphy

Instigator/editor/author/etc of Google SRE book


Tuesday July 12, 2016 11:00 - 11:40
Pembroke Room

11:00

Post Mortem Facilitation
Limited Capacity seats available

Speakers
avatar for Will Gallego

Will Gallego

Engineer, Etsy
I'm an engineer in Infrastructure at Etsy with focuses on database scaling, distributed workloads (typically with Gearman), and delivering optimized photos to our end users. | | I'm a proponent of a free and open web, blame awareness in incident investigations, and pronouncing gif with a soft "g".
avatar for Miriam Lauter

Miriam Lauter

Software Engineer, Etsy
I'm a software engineer on Etsy's payments team and a summer 2014 Recurse Center alum. Outside work, I'm an avid rock climber and 99pi podcast listener.


Tuesday July 12, 2016 11:00 - 12:20
Munster

11:40

The Virtuous Cycle: Getting Good Things out of Bad Failures
System failures happen. Hardware dies, software crashes, capacity gets exceeded, and any of these things can cause unexpected effects in the most carefully-architected systems.

At Heroku, we deal with complex systems failures. We’re running a platform as a service: our whole business model requires us to provide operations for our customers so they don’t have to do it themselves. We run over a million postgres databases, tens of thousands of redis, and hundreds of thousands of dynos on thousands of AWS instances.

What do we get out of these incidents? Pain and suffering? Yes, sometimes. We also get data about how our systems are actually working. We get ideas for making it work better. And sometimes we get ideas for whole new products.

In this talk, I’ll discuss how to take the bad of a system failure and turn it into good: better products, more reliable platforms, and less stressed engineers.

Speakers
avatar for Joy Schamen

Joy Schamen

SRE Director, Heroku


Tuesday July 12, 2016 11:40 - 12:20
Lansdowne

11:40

Alerting for Distributed Systems—A Tale of Symptoms and Causes, Signals and Noise
Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important. However, a distributed system is designed to tolerate local failures, and a human should only be paged on real or imminent problems of a service, ideally aggregated to one meaningful alert per problem. The definition of a problem should be clear and explicit rather than relying on some kind of automatic “anomaly detection”. Taking historical trends into account is needed, though, to detect imminent problems. Those predictions should be simple rather than “magic”. Alerting because “something seems weird” is almost never the right thing to do.

SoundCloud's long way from noisy pagers to much saner on-call rotations will serve as a case study, demonstrating how different monitoring technologies, among them most notably Prometheus, have affected alerting.

Speakers
avatar for Björn Rabenstein

Björn Rabenstein

Production Engineer, SoundCloud
Björn is a production engineer at SoundCloud and one of the Prometheus core developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.



Tuesday July 12, 2016 11:40 - 12:20
Pembroke Room

12:20

13:40

Challenges of Machine Learning at Scale
Motivated by the problem of predicting whether any given ad would be clicked in response to a query, in this introductory talk we outline the requirements and large-system design challenges that arise when designing a machine learning system that makes millions of predictions per second with low latency, learns quickly from the responses to those predictions, and maintains a consistent level of model quality over time. We present alternatives for meeting those challenges using diagrams of machine learning pipelines.

Concepts used in this talk: machine learning (classification), software pipelines, sharding and replication, map-reduce

Speakers
avatar for Graham Poulter

Graham Poulter

SRE, Google
I work at Google Dublin as an SRE on machine learning pipelines used in Ads & Commerce, helping make them reliable and efficient, including not wasting human time on things like updating config and software. Originally from South Africa, I also facilitate technical training and enjoy philosophical dialogue.


Tuesday July 12, 2016 13:40 - 14:20
Lansdowne

13:40

Lightning Talks

API Management—Why Speed Matters
Arianna Aondio, Varnish Software

Reverse Engineering the “Human API” for Automation and Profit
Nati Cohen, SimilarWeb

What a 17th Century Samurai Taught Me about Being an SRE
Caskey L. Dickson, Microsoft

Chatops/Automation: How to get there while everything's on fire
Fran Garcia, Hosted Graphite

Sysdig Love
Alejandro Brito Monedero, Alea Solutions

Automations with Saltstack
Effie Mouzeli, Logicea, LLC

Myths of Network Automation
David Rothera, Facebook

DNS @ Shopify
Emil Stolarsky, Shopify

Hashing Infrastructures
Jimmy Tang, Rapid7


Speakers
avatar for Arianna Aondio

Arianna Aondio

Field Engineer, Varnish Software
I'm Italian, living in Norway. Field engineer for Varnish Software, working on websites performances. I love cooking, travelling and skiing.
avatar for Fran Garcia

Fran Garcia

SRE, Hosted Graphite
Currently the SRE team lead at Hosted Graphite, Fran has previously been mostly responsible for causing (and occasionally preventing) outages in varied fields such as advertising, online gaming and sports betting. Do not ask him about chatops.
avatar for Effie Mouzeli

Effie Mouzeli

Systems Engineer, Logicea, LLC
avatar for David Rothera

David Rothera

Production Engineer, Facebook
avatar for Emil Stolarsky

Emil Stolarsky

Production Engineer, Shopify
Emil is a production engineer at Shopify where he works on performance, the production pipeline, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's shivering over a spiked cup of coffee in the great Canadian north.
avatar for Jimmy Tang

Jimmy Tang

Rapid7



Tuesday July 12, 2016 13:40 - 15:00
Pembroke Room

13:40

Effective Design Review Participation
Limited Capacity seats available

This workshop is a part of the "full lifecycle" workshop track which includes Post-Mortems, Incident Response, and Effective Design Review Participation. Using several example cases, participants in this session will learn to apply a variety of different points of view to analyze a design for issues which could affect its reliability and operability.

The sample designs and play list can be found at https://goo.gl/VIiN6i - now updated with the comments and suggestions that came in during the workshop.


Speakers
avatar for Kurt Andersen (LinkedIn)

Kurt Andersen (LinkedIn)

Program Committee, LinkedIn
Kurt Andersen has been active in the anti-abuse community for over 15 years and is currently the senior IC for the Consumer Services SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M³AAWG.org). He has spoken at M³AAWG, Velocity, SREcon(US & EU), and SANOG on various aspects of reliability, authentication and security. | | You can also blame me for the... Read More →



Tuesday July 12, 2016 13:40 - 15:00
Munster

13:40

DivOps, Continuous Diversity at Scale
Limited Capacity seats available

This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes. 

Speakers

Tuesday July 12, 2016 13:40 - 17:00
Ulster

14:20

Panel: Oncall
Moderators
LN

Laura Nolan

I am a SRE and tech lead at Google, working in our ads data infrastructure. I presented a workshop and a talk at SRECon Europe 2015, and have presented in workshops at other USENIX conferences (LISA, federated conferences) and FLOSS UK.

Tuesday July 12, 2016 14:20 - 15:00
Lansdowne

15:00

Break with Refreshments
Tuesday July 12, 2016 15:00 - 15:40
Pre-Function Area

15:40

Lessons from Automatic Incident Resolution for a Million Databases
In the beginning of Heroku Postgres, low pager volume were a sign of broken monitoring, not a healthy fleet. Running customer databases on AWS, we needed an automated way to resolve routine failures, so a state machine based framework was slowly grown, to handle issues with server and service availability, filled disks, failed backups, failed server boots, stuck EBS volumes and other incidents that would otherwise wake an engineer. A framework emerged as the basis for writing flexible and robust incident resolution automation, and it has grown to now power High Availability and Disaster Recovery for the Heroku Postgres and Heroku Redis services.

This talk will be comprised of:
- An overview of how to run Postgres and Redis in the cloud
- An incomplete survey of what can go wrong in the cloud and how we fix it
- An introduction to state machines
- How to convert playbooks into state machines
- Proper panics and circuit breakers for your automation
- Limits of automation
- Limits of people
- What we would change if we could do it again
- How to continue scaling up and out

Speakers
avatar for Greg Burek

Greg Burek

Engineer, Heroku
Greg Burek is a senior engineer with the Heroku Data team, which runs Heroku Postgres and Heroku Redis. I am a core contributor to and on the pager rotation for the automated incident resolution system referenced in this talk.


Tuesday July 12, 2016 15:40 - 16:00
Lansdowne

15:40

My Service Runs at 99.999%...All Those Tweets about Outages Are Not Real: It's Our Competition Trying to Malign Us!

Do you have services where the owners claim they run at five 9's but you often run into errors? It's very easy and convenient to build metrics at the service level. These often hide a wide array of issues that users might face. Having the right metrics is a key component of building sustainable SRE culture. This talk goes into the design of these metrics, real world examples to illustrate good/bad designs.


Speakers
KS

Kumar Srinivasamurthy

Kumar is a 15 year veteran at Microsoft and has been in the online services world for several years. He currently runs the Bing Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale.


Tuesday July 12, 2016 15:40 - 16:00
Pembroke Room

15:40

Practical Incident Response
Limited Capacity seats available

This workshop is structured as a fast-moving but fun game (think fluxx crossed with a hectic oncall shift) but the subject matter is entirely serious: we will use it to explore best practices and pitfalls for managing incidents as a team. You will work as part of a team managing a production outage: we'll go through the entire process from detection of the incident, problem diagnosis, mitigation, and resolution, finishing with the first draft of the postmortem.

Speakers
LN

Laura Nolan

I am a SRE and tech lead at Google, working in our ads data infrastructure. I presented a workshop and a talk at SRECon Europe 2015, and have presented in workshops at other USENIX conferences (LISA, federated conferences) and FLOSS UK.


Tuesday July 12, 2016 15:40 - 17:00
Munster

16:00

Moving a Large Workload from a Public Cloud to an OpenStack Private Cloud: Is It Really Worth It?
Speakers
NB

Nicolas Brousse

TubeMogul
Nicolas Brousse is Senior Director of Operations Engineering at TubeMogul (NASDAQ: TUBE). The company's sixth employee and first operations hire, Nicolas has grown TubeMogul's infrastructure over the past seven years from several machines to over two thousand servers that handle billions of requests per day for clients like Allstate, Chrysler, Heineken and Hotels.com. | | Adept at adapting quickly to ongoing business needs and constraints... Read More →


Tuesday July 12, 2016 16:00 - 16:20
Lansdowne

16:00

Availability Objectives of SoundCloud’s Microservices
In a microservices architecture, different services usually have different availabilities. It is often hard to see how the availability of a single service affects the availability of the overall system. Without a clear idea about the availability requirements of individual services, even a seemingly subtle degradation of a service can cause a critical outage. Unfortunately these are discovered only after thorough post-mortems. At SoundCloud we kicked off a project called “Availability Objectives”. An availability objective is the minimum availability a service is allowed to have. These objectives are calculated based on the requirements of the clients of those services. We started by visiting all of our services and setting an availability objective for each of them. We built tools to expose the availability of these services and to flag the ones that drop below their objectives. As a result, we can now make informed decisions about the integration points we need to improve first. This talk will share the insights we gained via this project and how it affected our overall availability and engineering productivity.

Speakers
BT

Bora Tunca

Bora is a software developer at SoundCloud. He started his journey there three years ago. As a generalist, he has worked on various parts of their architecture. Nowadays he is part of the Core Engineering, where he helps to build and integrate the core business services of SoundCloud. When he's not juggling various languages, he's playing basketball - as long as someone on the team covers his on-call shifts...


Tuesday July 12, 2016 16:00 - 16:20
Pembroke Room

16:20

My Scariest Day: When Things Go All Wrong

Lightning Talks session


Moderators
avatar for Gareth Eason

Gareth Eason

Technical Program Manager, Facebook
Gareth works as a Technical Program Manager with Facebook, focusing on designing and building their growing global network and content delivery infrastructure. Combining experience of systems architecture with networking and telecoms, Gareth has worked with Nokia, Cable & Wireless, HEAnet and Google. Come ask about Linux systems, care and feeding of large systems, CDN infrastructure or the successful use of Raspberry Pis for things they were... Read More →
JL

John Looney

SREconEU Program Chair

Tuesday July 12, 2016 16:20 - 17:00
Lansdowne

16:20

Downtime Budgets
The concept of the error budget is a great way to hack SLAs and make them into a positive tool for system engineers. But how can you take the same idea from a world that handles millions of transactions in a day to one that handles hundreds, but on the same hardware scale? High Performance Computing jobs run for hours, days, or weeks at a time, resulting in unique challenges related to system availability, maintenance, and experimentation. In this talk I will explore how we plan to modify the error budget concept to fit in an HPC environment by applying the same idea to cluster outages. With that specific example as the foundation, I will conclude with some thoughts on how the ideas generated in large scale web environments can be used in similarly sized environments running very different workloads.

Speakers
avatar for Cory Lueninghoener

Cory Lueninghoener

Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested in turning large-scale system research into practice, and has worked on configuration management and system management tools in the past. Cory was co-chair of LISA... Read More →


Tuesday July 12, 2016 16:20 - 17:00
Pembroke Room

17:30

Happy Hour, Sponsored by Facebook
Tuesday July 12, 2016 17:30 - 18:30
~Herbert

19:00

Tuesday BoFs
Ad hoc topics of interest - for current details, please see: https://www.usenix.org/conference/srecon16europe/bofs

BoFs may be scheduled in advance by contacting bofs@usenix.org with "SREcon16 Europe BoF" in the subject line and the following information in the body of the email:

  1. BoF title
  2. Organizer name and affiliation
  3. Date and time preference
  4. Brief description of BoF (optional)


Tuesday July 12, 2016 19:00 - 22:00
TBA
 
Wednesday, July 13
 

08:00

Morning Coffee & Tea
Wednesday July 13, 2016 08:00 - 09:00
Pre-Function Area

09:00

Past, Present, and Future of Network Operations

Historically the network has lacked the skills, the tools and even the means to fully embrace automation or build abstractions for the rest of the organization to consume. However, the tide is changing and most modern equipment nowadays provide standard linux tools or open APIs to interact with them.

In this talk, we will explore how to build network abstractions and leverage on the experience gathered by the devops community over the years to expose the network to the organization, increase agility and provide situational awareness.


Speakers
avatar for David Barroso

David Barroso

Network Systems Engineer, Fastly
David is a Network Systems Engineer at Fastly where he spends his time dealing with the network with code and thinking in ways of integrating it with the application.



Wednesday July 13, 2016 09:00 - 09:20
Pembroke Room

09:00

Relieving Technical Debt through Short Projects
It's easy to plan out month-long or year-long projects, or to have an interrupts rotation for dealing with oncall/tickets, but how do you make sure you're also doing the short week-long projects that can relieve your technical debt? I'll cover a planning approach that my team found that makes room for all three sets of work, reducing in the long term the operational burden of the services we operate.

Speakers
LF

Liz Fong-Jones

Liz is a Senior Site Reliability Engineer at Google and manages a team of SREs responsible for Google's storage systems. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.


Wednesday July 13, 2016 09:00 - 09:20
Lansdowne

09:00

Distributed Log-Processing Design Workshop
Limited Capacity seats available

Participants will have the opportunity to try their hand on designing a reliable, distributed, multi-datacenter near-real-time log processing system.

The session will start with a short presentation on lessons learned about designing reliable distributed systems, and then participants will break out in small groups, assisted by Google facilitators, and try their hand at solving a real-world design challenge, from high-level architecture down to an estimate of the computing resources required to run the service.

The session will likely appeal to experienced engineers who want to have fun tackling a real-world design problem faced by many teams in Google.


Speakers
avatar for Andrea Spadaccini

Andrea Spadaccini

SRE Manager, Google


Wednesday July 13, 2016 09:00 - 12:40
Munster

09:00

Docker From Scratch
Limited Capacity seats available

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Speakers
avatar for Avishai Ish-Shalom

Avishai Ish-Shalom

Avishai Ish-Shalom is a veteran Ops and a survivor of many production skirmishes. Avishai helps companies deal with web era operations and scale as an independent consultant. In his spare time Avishai is spreading weird ideas and conspiracy theories such as DevOps.


Wednesday July 13, 2016 09:00 - 14:40
Ulster

09:20

Bridging Multicast to the Cloud
As more organizations move their workloads to cloud providers, they may discover small gotchas that prevent them from easily running the same applications that they’re used to within a traditional on-premises environment. One such example is multicast: none of the big players support multicast traffic between nodes in their cloud offerings. The solution? Overlay a mesh virtual network - n2n - that supports multicast between nodes and can be extended to include on-prem systems. In this talk, we’ll go over how to implement n2n in a resilient, scalable fashion to link on-prem and cloud environments through a gateway.

Speakers
avatar for Daniel Emord

Daniel Emord

Lead Site Reliability Consultant, Pythian
Dan Emord has been designing and deploying multiplatform solutions for 8 years and is currently a Lead Site Reliability Consultant at Pythian. Dan runs the gamut of client requests with Pythian’s open scope engagements, including architecting and implementing a wide variety of technologies to solve problems such as network design, multiplatform system automation, and customizing open source tools.


Wednesday July 13, 2016 09:20 - 09:40
Pembroke Room

09:20

Running Storage at Facebook
The mission of the Data Warehouse Storage team at Facebook is to run an HDFS deployment that stores hundreds of petabytes reliably and efficiently.

Due to its stateful nature, there are some unique challenges to operating a storage system. How do we take a machine out of production for repair without compromising data availability? What trade-offs do we make between replication strategy and data availability to make sure we get more bang for the buck? How do we ensure that colocated tasks that run on our storage nodes exploit available resources as much as they can without tipping hosts over?

In this talk, we'll share some of the lessons that we have learnt from running HDFS at Facebook. We will discuss our biggest operational challenges and we'll outline the evolution of the different solutions that we put in have in place over time. We will also introduce Warm Storage, a novel block storage system that we built at Facebook to replace HDFS, and we'll discuss how our learnings from HDFS have affected the design of Warm Storage.

Speakers
avatar for Federico Piccinini

Federico Piccinini

Production Engineer, Facebook
Federico is a Production Engineer at Facebook and has been working for the past one and a half year on large scale block storage systems. Before Facebook, Federico help running the Storage infrastructure at Spotify. Likes: open source, large scale distributed systems and Broadway shows.


Wednesday July 13, 2016 09:20 - 09:40
Lansdowne

09:40

Full-Mesh IPsec Network: 10 Dos and 500 Don'ts
How do you secure your internal network when your servers are located on different continents with different providers and you don't trust your network?

IPSec is a great way to secure a network but it's usually deployed as a way of connecting a small group of trusted networks, and both tools and existing documentation reflect this. This is not really an option in some environments where you don't really control the network and want to interoperate across different providers, so you find yourself sailing through uncharted waters at times when trying to build a fully meshed network with IPSec, where each server can establish a secure connection to any other server in its cluster.

We wanted any of our servers around the world to be able to communicate securely with any other. We were using a peer to peer VPN, but it broke down badly at scale and we chose to go with IPSec. It wasn't a smooth transition; the tools were terrible, the documentation was vague and incomplete and we found some horrible bugs, but we survived and want to share with you some of the lessons we learned, what you definitely shouldn't do, and why you might want to do this.

Speakers
avatar for Fran Garcia

Fran Garcia

SRE, Hosted Graphite
Currently the SRE team lead at Hosted Graphite, Fran has previously been mostly responsible for causing (and occasionally preventing) outages in varied fields such as advertising, online gaming and sports betting. Do not ask him about chatops.


Wednesday July 13, 2016 09:40 - 10:20
Pembroke Room

09:40

Linux Kernel Building, Testing and Deployment at Facebook
The kernel team at Facebook works on both features and fixes for the upstream Linux community, as well as pulling in patches to apply to the kernels run in the Facebook production fleet. This is done in order to support new and upcoming hardware variations, as well as fix standing issues in the environment and improve performance. We aim to roll out a new kernel to a large portion of the fleet, as often as possible.

In this talk, we will explore how the kernel PE team has worked to automate the build and install process, rolling a canary of the newly built kernels every day, and gathering thousands of tests to validate each kernel before we push out to other tiers to upgrade. We run a series of integration tests across multiple hardware types and generations, do performance and correctness tests on the newly built kernels, and release the new kernel through multiple phases of release candidates and canary groups to gain confidence in the new builds. By doing this, we get a baseline for expectations when moving to the new kernel. We then work with individual tier owners to handle the upgrade, allowing for a regular kernel release and a sustainable support model in a way that is compatible with all of the different services. All this work allow us to run kernels that are as close as possible to the upstream releases.

Speakers
YB

Yannick Brosseau

Production Engineer, Facebook
Yannick Brosseau is a Production Engineer on the Kernel team at Facebook. As such he works on improving the stability and performance of the kernels deployed on the Facebook infrastructure and develops testing, monitoring and deployment tools to help in this endeavor. Previously, he was a Research associate at École Polytechnique de Montréal where he worked on performance analysis tools for Linux. He worked on several part of the LTTng project... Read More →
avatar for Phillip Duncan

Phillip Duncan

Production Engineer, Facebook


Wednesday July 13, 2016 09:40 - 10:20
Lansdowne

10:20

Break with Refreshments
Wednesday July 13, 2016 10:20 - 11:00
Pre-Function Area

11:00

Scaling Shopify's Multi-Tenant Architecture across Multiple Datacenters
Multi-tenant architectures are a very convenient and economical way to share resources like web servers, job workers, and datastores among several customers on your platform. Even the smallest Shopify store on a $9/month plan can easily survive getting hammered with a 1M RPM flash sale by leveraging the resources of the entire platform. However, architectures like this can also have several drawbacks. They are potentially harder to scale and things like resource starvation or back-end outages are harder to isolate.

In this talk, I’m going to walk you through the history of how Shopify grew from being a small standard single-database single-datacenter Rails application to the multi-database multi-datacenter setup that we run today. We will talk about the advantages in terms of resiliency, scalability, and disaster recovery that this architecture gives us, how we got there, and where we want to go in the future.

You will learn about things like how to use the Border Gateway Protocol and Equal-Cost Multi-Path routing for implementing intra-datacenter high availability, how we implement our own load balancing algorithms, what it takes to prepare a Ruby on Rails application for a move like this, and how we do completely scripted datacenter failovers in a matter of seconds with no considerable downtime.

Speakers
FW

Florian Weingarten

Originally from Germany, studied mathematics and computer science at RWTH-Aachen University. Did some research on cryptography and privacy in a previous life. Now working as an infrastructure engineer on the core architecture team at Shopify in Ottawa, Canada, poking holes into other people’s code.


Wednesday July 13, 2016 11:00 - 11:40
Pembroke Room

11:00

Extreme OS Kernel Testing

Fuzz testing has been used to evaluate the robustness of operating system distributions for over twenty years. Eventually, a fuzz test suite will suffer from reduced effectiveness. The first obstacle is the pesticide paradox: as you fix the easy defects, it gets difficult to find the remaining obscure defects. Also, the test execution time and the debug/fix cycle tends to be manual work that can take hours or even days of effort. During the presentation, a structured framework for creating new fuzz tests will be introduced, along with a competitive analysis approach used to minimize defect reproduction complexity.


Speakers
avatar for Kirk Russell

Kirk Russell

Production Engineer, Shopify
Kirk is currently a Production Engineer at Shopify, making sure that our docker image build system can keep up with 22 launches a day.


Wednesday July 13, 2016 11:00 - 11:40
Lansdowne

11:40

Leading a Team with Values
Having a small set of authentic, opinionated, collaboratively formed core values can be the magic ingredient to building a high performing, happy team. In this talk you'll hear the story of how, in the space of a few short months, the Intercom Ops team went from doing OK to AWESOME. We'll tell you about our core values, our values for creating values, our happiness metrics and finally, about how this approach can be applied to other teams.

Speakers
avatar for Rich Archbold

Rich Archbold

Director of Engineering, Intercom
Rich Archbold is an Engineering Director for Intercom, a highly successful and fast growing Irish technology startup company that provides customer communication software to Internet businesses. Intercom's mission is to make web business personal. At Intercom Rich is responsible for Ops, Infrastructure and a product teams ranging from front end development to platform technologies. Rich also leads a number of cross-functional people development... Read More →


Wednesday July 13, 2016 11:40 - 12:20
Pembroke Room

11:40

DNS: Old solution for modern problems

As infrastructure becomes more complex, dynamic, and diverse service discovery becomes very important.

There are many solutions to this problem (thrift, rest.li, custom-zk, etc.) all of which require application changes which precludes the use of off-the-shelf software.

We have applications at LinkedIn where it isn't practical to integrate with our internal service discovery systems. After some thought we decided that all of these applications do support a common service discovery system: our old friend DNS.

In this presentation, we'll talk about how we implemented a distributed, highly available, eventually consistent service discovery system using DNS written in Go. We'll talk about the design, implementation, and challenges encountered on the way to production.

We'll focus on:

  • Architecture
  • Extensibility
  • Availabilty
  • Operability

The Results:

  • Significantly reduced complexity
  • Dramatic decrease in convergence time
  • Ubiquitous service discovery
  • Leverage existing DNS infrastructure

Speakers
avatar for Rauf Guliyev

Rauf Guliyev

I am a Traffic SRE at LinkedIn responsible for shuffling bits between devices around the world and LinkedIn's service infrastructure. I like to solve all kinds of engineering problems, so I spend my free time building and an exoskeleton race kit car.
TJ

Thomas Jackson

https://www.linkedin.com/in/jacksontj


Wednesday July 13, 2016 11:40 - 12:20
Lansdowne

12:20

Conference Luncheon
Wednesday July 13, 2016 12:20 - 13:40
Sussex Restaurant

13:40

Active Fault Finding in Networks
One of the key principles of SRE is always knowing your service is broken before your customers notice. Network devices are typically black boxes that fail in mysterious ways. The tradiational methods of network monitoring don't scale well. They also have a key fundamental flaw. Most network monitoring involves asking a network device if it's dropping packets. If that device is fundamentally unhealthy, should you really trust what it tells you about what it is doing?

Facebook has taken a different approach to network monitoring where we actively probe our datacenter networks from as many locations as possible to ensure the network is behaving as we expect. The interesting past comes when we detect loss between different hosts on the network. How do we discover which network device or individual link in the thousands that are available is the root cause of this loss? By combining some magic requests with some basic math, can we automate that detection and triangulate it to a specific device that does not even know it's dropping packets?

Speakers
avatar for Richard Sheehan

Richard Sheehan

Production Engineer, Facebook
Production Engineer at Facebook with lots of Networking related experience. Spent 10 years at Amazon working on DNS, Load-balancing and CDN stuff. Never cried at my desk, not even once. Now build large scale network monitoring and fault isolation solutions at Facebook.


Wednesday July 13, 2016 13:40 - 14:20
Pembroke Room

13:40

The Knowledge: Towards a Culture of Engineering Documentation

For several years, Google's internal surveys identified the lack of trustworthy, discoverable documentation as the #1 problem impacting internal developer productivity. We're not alone: Stack Overflow's 2016 survey ranked ""Poor documentation"" as the #2 problem facing engineers. (insert that quote from NYC SRE here re SREs ""living and dying"" by the docs)

Solving this problem is tough. It's not enough to build tooling; the culture needs to change. Google internal engineering is attacking the challenge three ways: Building a documentation platform; integrating that platform into the engineering toolchain; and building a culture where documentation - like testing - is accepted as a natural, required part of the development process.

In this talk, we'll share our learnings and best practices around both tooling and culture, the evolution of documentation, and some thoughts about how we can transition from the creation of documents towards an ecosystem where context-appropriate, trustworthy documentation is reliably and effortlessly available to the engineers that need it.


Speakers
RM

Riona MacNamara

Staff technical writer, Google


Wednesday July 13, 2016 13:40 - 14:20
Lansdowne

13:40

Lightning Talks
To sign up for a lightning talk, write on the board outside of Munster Suite.

Wednesday July 13, 2016 13:40 - 14:40
Munster

14:20

Fixing the Internet for Real-Time Applications (Games)
League of Legends (LoL) is not a game of seconds, but of milliseconds. In day-to-day life, two seconds fly by unnoticed but in-game a two-second stun can feel like an eternity. In any single match of LoL, thousands of decisions made in milliseconds dictate which team scores bragging rights and which settles for “honorable opponent” points. The Internet, however, was not constructed for applications that run like this, essentially in real time. 

This talk will discuss the steps Riot Games has taken, and will continue to take, to fix this fundamental problem with commodity Internet, with a specific focus on the work done to improve the experience of our European players.

Speakers
avatar for Adam Comerford

Adam Comerford

Senior Systems Engineer, Riot Games
Adam Comerford is currently a Senior Systems Engineer at Riot Games in Dublin and obsessed with improving the League of Legends experience for players in Europe (and beyond). He has a broad technical background spanning 15+ years and multiple disciplines including networking, distributed systems, NoSQL databases and more.


Wednesday July 13, 2016 14:20 - 14:40
Pembroke Room

14:20

Dropbox's Naoru: Bridging the Safety Gap from Scripts to Full Auto-Remediation
In Dropbox automation, to bridge the gap between “scripts” and “fully automatic automation”, we’ve introduced a concept of “Human Authorized Execution”. This means that a tool automatically finds problems and decides how to fix them, but a human operator is required to audit the tool’s decisions before the automation may run.

Why do we need this? Frankly, it’s terrifying to have automation run fully automatically. With a human involved, their subconscious can answer a really important question: Why might I NOT want to run this script? If we took a simple approach, for instance deploying a cron job to run our scripts whenever alerts fire, then we would lose that human’s sense of paranoia and danger.

At Dropbox, we’ve built an alert auto-remediation platform called Naoru, which forces us to build our automation in a way that adheres to these principles. In this talk we will discuss the thought process we bring towards building trustworthy automation, how Naoru forces our engineers to follow these philosophies, and how we’ve driven our infrastructure organization towards a culture of embracing trustworthy automation.

Speakers
DM

David Mah

I am the primary author of Naoru, Dropbox’s auto-remediation framework. This includes creating the concept, designing the architecture, writing the software, and bringing it out to several teams within Dropbox’s Infrastructure organization. | | I’ve also given a talk about Dropbox's internal block storage infrastructure at last year’s Bay Area SRECON. Here is the link... Read More →


Wednesday July 13, 2016 14:20 - 14:40
Lansdowne

14:40

Break with Refreshments
Wednesday July 13, 2016 14:40 - 15:00
Pre-Function Area

15:00

Data Privacy Legislation and the Impact on SRE
Speakers
JL

John Looney

SREconEU Program Chair


Wednesday July 13, 2016 15:00 - 15:40
Lansdowne+Pembroke

15:40

Techniques and Tools for a Coherent Discussion about Performance in Complex Architectures
Most applications today have separate networked services measuring in the tens to hundreds; especially with the growing popularity of micro services. Crossing the boundary between these services often means a change in team and even a change in programming languages. In this session I will discuss the challenges this presents, why it is important to have a single engineering conversation about performance and how we can accomplish this.

Speakers
TS

Theo Schlossnagle

Theo founded Circonus in 2010 where he now serves as Founder and CEO. After earning undergraduate and graduate degrees from Johns Hopkins University in computer science researching resource allocation techniques in distributed systems during four years of post-graduate work. In 1997, Theo founded OmniTI, which has established itself as the go-to source for organizations facing today's most challenging scalability, performance and security... Read More →


Wednesday July 13, 2016 15:40 - 16:20
Lansdowne+Pembroke

16:20

Government Needs SRE
In 2013, Mikey Dickerson joined what became known as the “ad hoc” team, tasked with rescuing HealthCare.gov after its failed launch on October 1. In August 2014, President Obama established the United States Digital Service and appointed Mikey to serve as the Administrator to see if the strategy that succeeded at pulling Healthcare.gov out of the fire could be applied to other government problems. Now nearly 2 years old and about 150 people spanning a network of federal agencies, the U.S. Digital Service has taken on immigration, education, veterans benefits, and health data interoperability. The U.S. Digital Service is helping agencies build effective government services and improve IT procurements by focusing on industry best practices and agile methodology, ultimately driving change in the largest institution in history. Prior to joining the U.S. Digital Service, Mikey worked as a Site Reliability Manager at Google.

Speakers
MD

Mikey Dickerson

In 2013, Mikey Dickerson joined what became known as the “ad hoc” team, tasked with rescuing HealthCare.gov after its failed launch on October 1. In August 2014, President Obama established the United States Digital Service and appointed Mikey to serve as the Administrator to see if the strategy that succeeded at pulling Healthcare.gov out of the fire could be applied to other government problems. Now nearly 2 years old and about... Read More →


Wednesday July 13, 2016 16:20 - 17:00
Lansdowne+Pembroke