Loading…
This event has ended. Visit the official site or create your own event on Sched.
Monday, July 11
 

08:00 IST

Morning Coffee & Tea
Monday July 11, 2016 08:00 - 09:00 IST
Pre-Function Area

09:00 IST

Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
The principles and constructs of DevOps are pervading the industry and have lit the path for the capability to execute with both speed and quality in balance while managing hockey stick growth. Most companies and organizations are nowhere near “goal state”. As this audience knows, rubbing SRE, and the devops principles into existing companies and service code bases involves cultural engineering as well as deep tech investments. There is no larger enterprise cloud & consumer company in the world than Microsoft, and we’re on the journey now to investing heavily to shift to SRE. The core of Microsoft was not born in the cloud, but born in the first gen consumer (PC), and then rose to dominate Enterprise. There are a handful of services at Microsoft that were born in the cloud and have scaled massively, with Bing Search being the “grand daddy” with 12 years of experience in the pocket from which we have learned much. Microsoft Azure is the Enterprise Cloud. We are making an enormous investment to run at galactic scale to be the infrastructure for the world’s infrastructure and compete relentlessly in the market with top competitors. In this talk we will compare and contrast the journey within Bing to the current state of execution and how we are taking the lessons from that experience, inspiration from industry, and learnings to date as we build SRE within the Microsoft Enterprise services (Azure) which were not born in the cloud, but which have had enormous market success to date and are the future of our company.

Speakers
GV

Greg Veith

Greg is the Director of the Azure Site Reliability Engineering team in Azure, Microsoft’s cloud infrastructure - a multi-billion-dollar business that is the foundation of the company’s service offerings. Azure is deployed across all geographies and used worldwide by millions of... Read More →


Monday July 11, 2016 09:00 - 09:40 IST
Lansdowne+Pembroke
  Plenary

09:40 IST

Doorman: Global Distributed Client Side Rate Limiting
Doorman is a Google developed system for global distributed client side rate limiting. We are in the process of open sourcing it. With Doorman an arbitrary number of globally distributed clients can coordinate their usage of a shared resource so that the global usage does not exceed global capacity.

This presentation:
  • Describes the fundamentals of the Doorman system
  • Explains the concepts of the RPC protocol between Doorman components
  • Shows code examples of Doorman configurations and clients
  • Shows graphs of how Doorman clients ask for and get capacity, and how this sums up globally
  • Explains how Doorman deals with spikes, clients going away, servers going away
  • Explains Doorman's system reliability features
  • Points to the Doorman open source repository
  • Explains the Doorman simulation (in Python) which can be used to quickly verify Doorman's behaviour in a specific scenario

Speakers
JV

Jos Visser

Jos Visser has been working in the field of reliable and highly available systems since 1988. Starting as a systems programmer (MVS) at a bank, Jos's >25 year career has seen him working with a variety of mission critical systems technologies, including Stratus fault-tolerant systems... Read More →


Monday July 11, 2016 09:40 - 10:20 IST
Lansdowne+Pembroke

10:20 IST

Break with Refreshments
Monday July 11, 2016 10:20 - 11:00 IST
Pre-Function Area

11:00 IST

Panel: What is SRE?
Moderators
avatar for John Looney

John Looney

SREconEU Program Chair

Monday July 11, 2016 11:00 - 11:40 IST
Lansdowne+Pembroke

11:00 IST

Data Center Networks: The Rip van Winkle Edition
Limited Capacity seats available

If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven. Networks were no longer slow as molasses: to deploy, manage and upgrade. He'd find some things had stayed the same (IPv4 still ruled the roost), and some others not so much. He would be puzzled by the terminology and the discussions as he wandered the hallways. SDN, Open networking, Openflow, microservices, Ansible, Puppet, Kubernetes, and so on.

This tutorial is an attempt to bring folks up to speed on whats happened with networking in the past 10 years or so, especially in the data center, concluding with some thoughts on why exciting times lie ahead. The talk will be roughly divided into the following sections:

  1. Who Moved My Network ? What's causing all this turmoil in networking
  2. Solutions: Requirements, Terminology, Pros and Cons
  3. Changing Landscape: Network Topologies
  4. Changing Foundation: Network Protocols
  5. Changing Operations: Modern Operations
  6. Changing Residents: Modern applications and their implications on networks
  7. Reading Tea Leaves

The tutorial will include demos and hands on work with some modern tools.

The audience is expected to be aware of basic networking (bridging, routing, broadcast, multicast etc.).

The key takeways from this talk will be:

  • An understanding of the forces behind the changes in data center networking
  • The morphology an physiology of modern DC networks
  • What these changes presage of the future

Some preliminary ideas for hands on work:

  • Build multi-host container network
  • Build and configure a nxm CLOS topology with BGP
  • Design a CLOS for x number of servers given certain box specifications

Speakers

Monday July 11, 2016 11:00 - 17:00 IST
Ulster

11:00 IST

Staring into the eBPF Abyss
Limited Capacity seats available

eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.

In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.

Next, you will hack on some of the bleeding edge tools in the BCC toolkit, and build a couple of simple tools of your own. You'll be able to pick from a curated list of GitHub issues for the BCC project, a set of hands-on labs with known "school solutions", and an open-ended list of problems that need tools for effective analysis. At the end of this workshop, you will be equipped with a toolbox for diagnosing issues in the field, as well as a framework for building your own tools when the generic ones do not suffice.


Speakers
avatar for Sasha Goldshtein

Sasha Goldshtein

CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft Regional Director and MVP, Pluralsight and O’Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor... Read More →


Monday July 11, 2016 11:00 - 17:00 IST
Munster

11:40 IST

Building and Running SRE Teams
General Stanley McChrystal led the Joint Special Operations Task Force in Iran in the mid to late 2000's. While in command of the Task Force, he was responsible for transforming an organization which was dominated by Taylorist reductionism into an agile, responsive network which could dynamically adapt and win in the threat landscape around them. In his book Team of Teams: New Rules of Engagement for a Complex World, he outlines the key learnings that emerged from that process. The same issues and challenges face site reliability engineers and managers for SRE teams as we cope with the complexity of our own and partner ecosystems. In this talk, I will highlight the key points fromTeam of Teams: New Rules of Engagement for a Complex World and show how the solutions that helped make the Task Force successful can be applied to make SRE teams succeed too.

Outline:
Taylorism: Efficiency and Command Structure
Teams: Purpose Over Procedure
Shared Awareness: Democratizing Information - Everyone "Needs to Know"
Empowered Execution: "Eyes on - Hands Off"
Lead Like a Gardener: Fostering and Cultivating Organizational Culture

Speakers
avatar for Kurt Andersen

Kurt Andersen

Program Committee, LinkedIn
Kurt Andersen was one of the co-chairs for SREcon-Americas in 2017 and 2018. He has been active in the anti-abuse community for over 20 years and is currently the senior IC for the Product SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging... Read More →


Monday July 11, 2016 11:40 - 12:20 IST
Lansdowne+Pembroke

12:20 IST

Conference Luncheon
Monday July 11, 2016 12:20 - 13:40 IST
Sussex Restaurant

13:40 IST

The Production Engineering Lifecycle: How We Build, Run, and Disband Great Reliability-focused Teams
Engineers focused on reliability and scalability under real-world conditions are a scarce resource in any organization. How do we know where to deploy them, and how do we use them in the best possible way? In Facebook's Production Engineering team, we have this problem all the time, and we've dealt with it a variety of ways throughout the years. Some of these ways have worked better than others, and we'd like to share what works and what hasn't.

In this talk, we will share our approaches to when to start a production engineering team, how to integrate that team into the existing development team, how to prioritize and divide work between engineers, and even when to disband or merge the team. We will also discuss practical matters such as how we divide on call responsibilities and roadmap items, and how we integrate engineers in multiple locations and time zones.

Speakers
avatar for Andrew Ryan

Andrew Ryan

Meta
Andrew has been a member of Facebook's Production Engineering team since 2009. He currently works as a member of the Traffic Infrastructure team, helping to make Facebook faster for everyone.


Monday July 11, 2016 13:40 - 14:20 IST
Pembroke Room

14:20 IST

How to Improve Your Service by Roasting It

In many companies, including Microsoft, SRE is not yet an integrated part of the operational landscape. Instead it is being actively adapted into mature companies. Our team has been working to develop new and interesting ways to introduce SRE and its tenets to an organization with many different operational approaches ranging from IT Ops to DevOps.

The process of introducing SRE has proven to be quite complex and socially delicate: you can't go in to a team and just tell them they are doing things wrong. You need to find the right way to show a developer all the warts on their baby and motivate them to work with you on addressing them. Furthermore, you have to deal with their earnest desire to treat you as "just another ops team" who is only there to take the pager from them.
One of the tools we've used to enable the right conversations is to hold what we call a Service Roast. Named after the famous friar's club roasts, the goal is to establish a safe environment to dig into and expose those warts, wrinkles, design flaws, shortcomings, and problems everyone knows a service has but doesn't want to talk about. We can't help you if you won't tell us where it hurts.

To perform the Service Roasts, we've discovered some process, ground rules, a new role of impartial moderator, and some useful structure to host this kind of meeting. Thus far we've been able to obtain great insight into some of our services and more importantly created some very interesting (and lively) conversations.

To be sure, this is a high-risk activity, and shouldn't be done without careful consideration of the teams participating, but we'll present what we've learned about holding these roasts, guidance teams need for successful participation, and (importantly) why we don't use this approach everywhere.


Speakers
avatar for Jake Welch

Jake Welch

Principal Software Engineer, Microsoft
Jake Welch is a Site Reliability Engineer on the Microsoft Azure team in NYC. He has worked on large scale services for a decade, in both dev and operational roles. At Microsoft, he primarily works on infrastructure services with focus on Storage and Security.


Monday July 11, 2016 14:20 - 14:40 IST
Pembroke Room

14:20 IST

Flash Sale Engineering
From stores with ads in the Super Bowl to selling Kanye’s latest album, Shopify has built a name for itself handling some of the world’s largest flash sales. These high profile events generate write-heavy traffic that can be four times our platform’s baseline throughput and don’t lend themselves to off-the-shelf solutions.

This talk is the story of how we engineered our platform to survive large bursts of traffic. Since it’s not financially sound for Shopify to have the required capacity always running, we built queueing and page caching layers into our Nginx load balancers with Lua. To guarantee these solutions worked, we tested them with a purpose-built load testing service.

Although flash sales are unique to commerce platforms, the lessons we learn from them are applicable to any services that experience bursts of traffic.

Speakers
avatar for Emil Stolarsky

Emil Stolarsky

Production Engineer, Production Engineer, Shopify
Emil is a production engineer at Shopify where he works on performance, scriptable load balancers, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's shivering over a spiked cup of coffee in the great Canadian north.


Monday July 11, 2016 14:20 - 14:40 IST
Lansdowne

14:40 IST

What SRE Means in a Start-up
Speakers
avatar for Brian Scanlan

Brian Scanlan

Engineering Manager, Intercom


Monday July 11, 2016 14:40 - 15:00 IST
Pembroke Room

14:40 IST

Managing Up and Sideways as an SRE
Ever have a bad manager? Or have a project go off the rails but feel powerless to stop the trainwreck? I'll talk about why knowing a little bit about management can help you as an individual contributor or tech lead, and talk about a few ways that you can help yourself and your SRE team without ever formally managing yourself.

Speakers
avatar for Liz Fong-Jones

Liz Fong-Jones

Developer Advocate, Activist, and Site Reliability Engineer, Google
Liz is a Staff Site Reliability Engineer at Google and works on the Google Cloud Customer Reliability Engineering team in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates... Read More →


Monday July 11, 2016 14:40 - 15:00 IST
Lansdowne

15:00 IST

Break with Refreshments
Monday July 11, 2016 15:00 - 15:40 IST
Pre-Function Area

15:40 IST

Tier1 Metamorphoses
One of the Linkedin’s key cultural values is Career Transformation: Helping people you manage build new abilities and skills, work with them to define their career goals and support their efforts to accomplish them. Applying this to a Tier1 support team is challenging.

A Tier1 support manages the day-to-day operations of your business and engages higher tiers when needed. They end up with a very wide field of view but very little depth of knowledge. They are always the bearers of bad news and only noticed when something is broken. The morale of such teams is notoriously low. Furthermore, capitalizing on this experience for the business is a challenge because of retention issues stemming from low morale. This was Linkedin in 2013.

Today, we have transformed our tier-1 into the foundation of our SRE organization as an incubator for our SREs. Our objective was to add depth to their breadth: they are part of the resolution instead of just passing on bad news, their work is more valued, and they have gained the trust of higher tiers. As a result, team morale is at an all time high. Investing in automation, training, and mentorship was the key to their transformation. This is Linkedin today.

This session will discuss our roadblocks, learnings and achievements.

Speakers
avatar for Nina Mushiana

Nina Mushiana

SRE - Infrastructure, LinkedIn Corporation
Nina has been with Linkedin for 8+ years. She joined Linkedin as the NOC manager and then expanded her scope as Production SRE manager, responsible for BC/DR for Linkedin along with Incident Mgmt and Availability. 2 years ago, Nina transitioned to Infosec Sec org and currently leading... Read More →


Monday July 11, 2016 15:40 - 16:20 IST
Pembroke Room

15:40 IST

Capacity Planning at Scale
Have you ever bought machines? What if you need to even build datacenters? How can you predict how many you are going to need in two years from now? How can you make efficient use of all the resources you suddenly got? What if you are missing some resources? Can we automate all these stuff and integrate with our continuous delivery?

These are just a few questions anyone planning a large computer fleet always make. This talk will cover some of the approaches and tooling that can be used to effectively plan for the demand of services and how to cover it on the most efficient manner.

Speakers
RM

Ramón Medrano Llamas

Senior Site Reliability Engineer, Google


Monday July 11, 2016 15:40 - 16:20 IST
Lansdowne

16:20 IST

Panel: Brownfield SRE
Moderators
Monday July 11, 2016 16:20 - 17:00 IST
Pembroke Room

16:20 IST

Load Shedding—Approaches, Principles, Experiences, and Impact in Service Management
Cover the experience gained in developing load-shedding solutions and the impact in service management, at large scale.

Speakers
avatar for Acacio Cruz

Acacio Cruz

Director - Frameworks & Production Platforms, Google
Acacio has been an SRE manager since 2007, and manager of Google's Load-shedding & Traffic Management team since 2009. He is now a SWE Director in Frameworks and Software Infrastructure.


Monday July 11, 2016 16:20 - 17:00 IST
Lansdowne

17:30 IST

Conference Reception, Sponsored by Google
Sponsors
avatar for Google

Google

Gold Sponsor
Google is a global technology leader focused on improving the ways people connect with information. Google's innovations in web search and advertising have made its website a top internet property and its brand one of the most recognized in the world. For more information, visit... Read More →


Monday July 11, 2016 17:30 - 19:00 IST
~Herbert

20:00 IST

Open Source Distributed Load Balancing
Presented by Stefan Safar, Seznam.cz

Monday July 11, 2016 20:00 - 21:00 IST
Ulster
 
Tuesday, July 12
 

08:00 IST

Morning Coffee & Tea
Tuesday July 12, 2016 08:00 - 09:00 IST
Pre-Function Area

09:00 IST

Incident Response @ FB, Facebook's SEV process
Facebook is famous for our MOVE FAST AND BREAK THINGS motto. An important part of MOVING FAST while sustaining reliable systems is to FAIL FAST. This talk presents Facebook's strategy for Incident Response & Root Cause Analysis called the *Site Event (SEV) Process*. We'll describe everything from Incident Triage to Remediation paying special attention our desire fix things quickly and working to avoid having the same outage twice.

Speakers
avatar for Gareth Eason

Gareth Eason

Engineering Manager, Schibsted Product & Technology
Gareth works as an SRE Manager with Schibsted Product & Technology group. In the past, Gareth has worked with Nokia, Cable & Wireless, HEAnet, Google and Facebook. Come ask me about Linux systems, care and feeding of large systems, CDN infrastructure or the successful use of Raspberry... Read More →


Tuesday July 12, 2016 09:00 - 09:40 IST
Lansdowne

09:00 IST

The Many Ways Your Monitoring is Lying To You

Monitoring and dashboarding systems are crucial to understanding the behavior of large distributed systems. But monitoring systems can lead you on wild goose chases, or hide issues. In this talk, I will look at some examples of how a monitoring system can lie to you – in order to sensitize the audience to these failure modes and encourage them to look for similar examples in their own systems.


Speakers
avatar for Sebastian Kirsch

Sebastian Kirsch

Site Reliability Engineer, Google Switzerland GmbH
Sebastian Kirsch is a Site Reliability Engineer for Google in Zürich, Switzerland. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler or Google's payment processing systems, as well as on external products like Google... Read More →


Tuesday July 12, 2016 09:00 - 09:40 IST
Pembroke Room

09:00 IST

Accident Models in Post Mortems
Limited Capacity seats available

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. In this two part session, we will cover the theory and fundamentals on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

Speakers
avatar for Will Gallego

Will Gallego

Software Engineer, Fastly
Will Gallego is a systems engineer with 15+ years of experience in the web development field, currently as a Senior Software Engineer at Fastly. Comfortable with several parts of the stack, he focuses now on building scalable, distributed backend systems and tools to help engineers... Read More →
avatar for Miriam Lauter

Miriam Lauter

Software Engineer, Etsy
I'm a software engineer on Etsy's payments team and a summer 2014 Recurse Center alum. Outside work, I'm an avid rock climber and 99pi podcast listener.


Tuesday July 12, 2016 09:00 - 10:20 IST
Munster

09:00 IST

Statistics for Engineers
Limited Capacity seats available

Gathering telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
  • Is the system down?
  • Is user experience degraded for some percentage of our customers?
  • How did our query response times change with the last update?
Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as an SRE. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations.

The tutorial focuses on practical aspects, and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX tools and the IPython toolkit.

This tutorial has been given at several occasions over the last year and has been refined and extended since, cf. Twitter #StatsForEngineers

Speakers
avatar for Heinrich Hartmann

Heinrich Hartmann

Analytics Lead, Circonus
Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician... Read More →


Tuesday July 12, 2016 09:00 - 12:20 IST
Ulster

09:40 IST

Production Improvement Review (PIR)
Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart. The greater the virtual debt, the bigger the bite.

We’ll also discuss how we expose the data in near real time dashboards that allow team members from the director all the way down to the IC to see relevant views and take the appropriate action. IC’s can find incomplete postmortems they need to work on, a service director can view his accumulated debt to prioritize resources, or a dev manager can review virtual debt to ensure the team is conducting rigorous postmortems. By analyzing historical outages, we’ve found that missed detection leads to an exponential increase in mitigation times. We’ve collected a myriad of other insights by mining through historical outage data and using charts and creative visualizations to surface insights, including what surprising proxy metrics we’ve discovered that influence uptime, and some specific examples of actions we’ve taken to improve service quality based on the data.

Speakers
avatar for Martin Check

Martin Check

Principal Engineering Manager, Microsoft
I work on problem management, incident management, and delivering services for SRE. I'm particularly interested in uplifting services from traditional operating models into Devops/SRE operating models, collating and analyzing data on service health to identify themes, and developing... Read More →


Tuesday July 12, 2016 09:40 - 10:20 IST
Lansdowne

09:40 IST

Practical Anomaly Detection and Alerting
This talk will debunk some common beliefs that in order to solve more [advanced] monitoring use cases and get more complete alerting coverage, we need complex, often math-oriented solutions such as machine learning and stream processing.

Instead we will set a clear context and pro's/cons for such approaches, and zoom in on how we can get dramatically better alerting, as well as make our lives a lot easier by using familiar concepts understandable to everyone such as basic logic, basic math and metric metadata, even for solving complicated alerting problems.

We will also see how we can optimize the overall experience of adjusting and maintaining alerting rules over time by focusing on the concept of an alerting IDE, exemplified by bosun. The talk will present techniques and concrete examples of how to do advanced alerting scenarios using these principles.

Speakers
DP

Dieter Plaetinck

My ops experience is mostly from working at netlog and vimeo. the last few years i've been working 100% on open source monitoring. For past talks (including LISA in seattle), see http://dieter.plaetinck.be/talks/


Tuesday July 12, 2016 09:40 - 10:20 IST
Pembroke Room

10:20 IST

Break with Refreshments
Tuesday July 12, 2016 10:20 - 11:00 IST
Pre-Function Area

11:00 IST

The Next Linux Superpower: eBPF Primer

Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.

Hopefully, you won't have to yearn for a lot longer. eBPF (extended Berkeley Packet Filters) is a kernel technology that enables a plethora of diagnostic scenarios by introducing dynamic, safe, low-overhead, efficient programs that run in the context of your live kernel. Sure, BPF programs can attach to sockets; but more interestingly, they can attach to kprobes and uprobes, static kernel tracepoints, and even user-mode static probes. And modern BPF programs have access to a wide set of instructions and data structures, which means you can collect valuable information and analyze it on-the-fly, without spilling it to huge files and reading them from user space.

In this talk, we will introduce BCC, the BPF Compiler Collection, which is an open set of tools and libraries for dynamic tracing on Linux. Some tools are easy and ready to use, such as execsnoop, fileslower, and memleak. Other tools such as trace and argdist require more sophistication and can be used as a Swiss Army knife for a variety of scenarios. We will spend most of the time demonstrating the power of modern dynamic tracing -- from memory leaks to static probes in Ruby, Node, and Java programs, from slow file I/O to monitoring network traffic. Finally, we will discuss building our own tools using the Python and Lua bindings to BCC, and its LLVM backend.


Speakers
avatar for Sasha Goldshtein

Sasha Goldshtein

CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft Regional Director and MVP, Pluralsight and O’Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor... Read More →


Tuesday July 12, 2016 11:00 - 11:40 IST
Lansdowne

11:00 IST

The Structure and Interpretation of Graphs
Limited Capacity full
Adding this to your schedule will put you on the waitlist.

Speakers
NM

Niall Murphy

Instigator/editor/author/etc of Google SRE book


Tuesday July 12, 2016 11:00 - 11:40 IST
Pembroke Room

11:00 IST

Post Mortem Facilitation
Limited Capacity seats available

Speakers
avatar for Will Gallego

Will Gallego

Software Engineer, Fastly
Will Gallego is a systems engineer with 15+ years of experience in the web development field, currently as a Senior Software Engineer at Fastly. Comfortable with several parts of the stack, he focuses now on building scalable, distributed backend systems and tools to help engineers... Read More →
avatar for Miriam Lauter

Miriam Lauter

Software Engineer, Etsy
I'm a software engineer on Etsy's payments team and a summer 2014 Recurse Center alum. Outside work, I'm an avid rock climber and 99pi podcast listener.


Tuesday July 12, 2016 11:00 - 12:20 IST
Munster

11:40 IST

The Virtuous Cycle: Getting Good Things out of Bad Failures
System failures happen. Hardware dies, software crashes, capacity gets exceeded, and any of these things can cause unexpected effects in the most carefully-architected systems.

At Heroku, we deal with complex systems failures. We’re running a platform as a service: our whole business model requires us to provide operations for our customers so they don’t have to do it themselves. We run over a million postgres databases, tens of thousands of redis, and hundreds of thousands of dynos on thousands of AWS instances.

What do we get out of these incidents? Pain and suffering? Yes, sometimes. We also get data about how our systems are actually working. We get ideas for making it work better. And sometimes we get ideas for whole new products.

In this talk, I’ll discuss how to take the bad of a system failure and turn it into good: better products, more reliable platforms, and less stressed engineers.

Speakers
avatar for Joy Schamen

Joy Schamen

SRE Director, Heroku


Tuesday July 12, 2016 11:40 - 12:20 IST
Lansdowne

11:40 IST

Alerting for Distributed Systems—A Tale of Symptoms and Causes, Signals and Noise
Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important. However, a distributed system is designed to tolerate local failures, and a human should only be paged on real or imminent problems of a service, ideally aggregated to one meaningful alert per problem. The definition of a problem should be clear and explicit rather than relying on some kind of automatic “anomaly detection”. Taking historical trends into account is needed, though, to detect imminent problems. Those predictions should be simple rather than “magic”. Alerting because “something seems weird” is almost never the right thing to do.

SoundCloud's long way from noisy pagers to much saner on-call rotations will serve as a case study, demonstrating how different monitoring technologies, among them most notably Prometheus, have affected alerting.

Speakers
avatar for Björn Rabenstein

Björn Rabenstein

Engineer, Grafana Labs
Björn is a Production Engineer at SoundCloud and a Prometheus developer. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.



Tuesday July 12, 2016 11:40 - 12:20 IST
Pembroke Room

12:20 IST

13:40 IST

Challenges of Machine Learning at Scale
Motivated by the problem of predicting whether any given ad would be clicked in response to a query, in this introductory talk we outline the requirements and large-system design challenges that arise when designing a machine learning system that makes millions of predictions per second with low latency, learns quickly from the responses to those predictions, and maintains a consistent level of model quality over time. We present alternatives for meeting those challenges using diagrams of machine learning pipelines.

Concepts used in this talk: machine learning (classification), software pipelines, sharding and replication, map-reduce

Speakers
avatar for Graham Poulter

Graham Poulter

SRE, Google
I work at Google Dublin as an SRE on machine learning pipelines used in Ads & Commerce, helping make them reliable and efficient, including not wasting human time on things like updating config and software. Originally from South Africa, I also facilitate technical training and enjoy... Read More →


Tuesday July 12, 2016 13:40 - 14:20 IST
Lansdowne

13:40 IST

Lightning Talks

API Management—Why Speed Matters
Arianna Aondio, Varnish Software

Reverse Engineering the “Human API” for Automation and Profit
Nati Cohen, SimilarWeb

What a 17th Century Samurai Taught Me about Being an SRE
Caskey L. Dickson, Microsoft

Chatops/Automation: How to get there while everything's on fire
Fran Garcia, Hosted Graphite

Sysdig Love
Alejandro Brito Monedero, Alea Solutions

Automations with Saltstack
Effie Mouzeli, Logicea, LLC

Myths of Network Automation
David Rothera, Facebook

DNS @ Shopify
Emil Stolarsky, Shopify

Hashing Infrastructures
Jimmy Tang, Rapid7


Speakers
avatar for Arianna Aondio

Arianna Aondio

Field Engineer, Varnish Software
I'm Italian, living in Norway. Field engineer for Varnish Software, working on websites performances. I love cooking, travelling and skiing.
avatar for Nati Cohen

Nati Cohen

HERE Mobility
Nati Cohen is a Production Engineer at Here Technologies and a Teaching Assistant at the Interdisciplinary Center Herzliya. Previous experience includes: operations consulting, software development, *nix administration and security research in the Intelligence Corps as well as in... Read More →
avatar for Fran Garcia

Fran Garcia

SRE, Hosted Graphite
Currently the SRE team lead at Hosted Graphite, Fran has previously been mostly responsible for causing (and occasionally preventing) outages in varied fields such as advertising, online gaming and sports betting. Do not ask him about chatops.
avatar for effie mouzeli

effie mouzeli

Systems Engineer, Logicea LLC
Systems Engineer at Logicea, a young software house. Main responsibilities are operations, automation (Deployment Pipelines, Configuration Management etc.), assist in product architecture, work closely with developers and occasionally, pull rabbits out of hats and chase them.
avatar for David Rothera

David Rothera

Production Engineer, Facebook
avatar for Emil Stolarsky

Emil Stolarsky

Production Engineer, Production Engineer, Shopify
Emil is a production engineer at Shopify where he works on performance, scriptable load balancers, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's shivering over a spiked cup of coffee in the great Canadian north.
avatar for Jimmy Tang



Tuesday July 12, 2016 13:40 - 15:00 IST
Pembroke Room

13:40 IST

Effective Design Review Participation
Limited Capacity seats available

This workshop is a part of the "full lifecycle" workshop track which includes Post-Mortems, Incident Response, and Effective Design Review Participation. Using several example cases, participants in this session will learn to apply a variety of different points of view to analyze a design for issues which could affect its reliability and operability.

The sample designs and play list can be found at https://goo.gl/VIiN6i - now updated with the comments and suggestions that came in during the workshop.


Speakers
avatar for Kurt Andersen

Kurt Andersen

Program Committee, LinkedIn
Kurt Andersen was one of the co-chairs for SREcon-Americas in 2017 and 2018. He has been active in the anti-abuse community for over 20 years and is currently the senior IC for the Product SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging... Read More →



Tuesday July 12, 2016 13:40 - 15:00 IST
Munster

13:40 IST

DivOps, Continuous Diversity at Scale
Limited Capacity seats available

This tutorial/workshop is aimed at management and individual contributors alike. We will work together on how to encourage and nurture a diversity culture in day-to-day ops teams. First we will discuss the concepts of 2- and 3-dimensional diversity, and the statistics around diverse teams performance. Then we will map out how to design, build, deploy and operate a diversity plan in our teams. This will include diversity goal setting and explicit cultural evolution, hiring processes, day to day communications, review process and team collaboration. Where possible we will encourage groups to break out and evaluate their own cultures and processes. 

Speakers

Tuesday July 12, 2016 13:40 - 17:00 IST
Ulster

14:20 IST

Panel: Oncall
Moderators
LN

Laura Nolan

Google LLC
I am a SRE and tech lead at Google, working in our ads data infrastructure. I presented a workshop and a talk at SRECon Europe 2015, and have presented in workshops at other USENIX conferences (LISA, federated conferences) and FLOSS UK.

Tuesday July 12, 2016 14:20 - 15:00 IST
Lansdowne

15:00 IST

Break with Refreshments
Tuesday July 12, 2016 15:00 - 15:40 IST
Pre-Function Area

15:40 IST

Lessons from Automatic Incident Resolution for a Million Databases
In the beginning of Heroku Postgres, low pager volume were a sign of broken monitoring, not a healthy fleet. Running customer databases on AWS, we needed an automated way to resolve routine failures, so a state machine based framework was slowly grown, to handle issues with server and service availability, filled disks, failed backups, failed server boots, stuck EBS volumes and other incidents that would otherwise wake an engineer. A framework emerged as the basis for writing flexible and robust incident resolution automation, and it has grown to now power High Availability and Disaster Recovery for the Heroku Postgres and Heroku Redis services.

This talk will be comprised of:
- An overview of how to run Postgres and Redis in the cloud
- An incomplete survey of what can go wrong in the cloud and how we fix it
- An introduction to state machines
- How to convert playbooks into state machines
- Proper panics and circuit breakers for your automation
- Limits of automation
- Limits of people
- What we would change if we could do it again
- How to continue scaling up and out

Speakers
avatar for Greg Burek

Greg Burek

Engineer, Heroku
Greg Burek is a senior engineer with the Heroku Data team, which runs Heroku Postgres and Heroku Redis. I am a core contributor to and on the pager rotation for the automated incident resolution system referenced in this talk.


Tuesday July 12, 2016 15:40 - 16:00 IST
Lansdowne

15:40 IST

My Service Runs at 99.999%...All Those Tweets about Outages Are Not Real: It's Our Competition Trying to Malign Us!

Do you have services where the owners claim they run at five 9's but you often run into errors? It's very easy and convenient to build metrics at the service level. These often hide a wide array of issues that users might face. Having the right metrics is a key component of building sustainable SRE culture. This talk goes into the design of these metrics, real world examples to illustrate good/bad designs.


Speakers
KS

Kumar Srinivasamurthy

Microsoft Corp
Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing and Cortana Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening... Read More →


Tuesday July 12, 2016 15:40 - 16:00 IST
Pembroke Room

15:40 IST

Practical Incident Response
Limited Capacity seats available

This workshop is structured as a fast-moving but fun game (think fluxx crossed with a hectic oncall shift) but the subject matter is entirely serious: we will use it to explore best practices and pitfalls for managing incidents as a team. You will work as part of a team managing a production outage: we'll go through the entire process from detection of the incident, problem diagnosis, mitigation, and resolution, finishing with the first draft of the postmortem.

Speakers
LN

Laura Nolan

Google LLC
I am a SRE and tech lead at Google, working in our ads data infrastructure. I presented a workshop and a talk at SRECon Europe 2015, and have presented in workshops at other USENIX conferences (LISA, federated conferences) and FLOSS UK.


Tuesday July 12, 2016 15:40 - 17:00 IST
Munster

16:00 IST

Moving a Large Workload from a Public Cloud to an OpenStack Private Cloud: Is It Really Worth It?
Speakers
NB

Nicolas Brousse

TubeMogul
Nicolas Brousse is Senior Director of Operations Engineering at TubeMogul (NASDAQ: TUBE). The company's sixth employee and first operations hire, Nicolas has grown TubeMogul's infrastructure over the past seven years from several machines to over two thousand servers that handle billions... Read More →


Tuesday July 12, 2016 16:00 - 16:20 IST
Lansdowne

16:00 IST

Availability Objectives of SoundCloud’s Microservices
In a microservices architecture, different services usually have different availabilities. It is often hard to see how the availability of a single service affects the availability of the overall system. Without a clear idea about the availability requirements of individual services, even a seemingly subtle degradation of a service can cause a critical outage. Unfortunately these are discovered only after thorough post-mortems. At SoundCloud we kicked off a project called “Availability Objectives”. An availability objective is the minimum availability a service is allowed to have. These objectives are calculated based on the requirements of the clients of those services. We started by visiting all of our services and setting an availability objective for each of them. We built tools to expose the availability of these services and to flag the ones that drop below their objectives. As a result, we can now make informed decisions about the integration points we need to improve first. This talk will share the insights we gained via this project and how it affected our overall availability and engineering productivity.

Speakers
BT

Bora Tunca

Bora is a software developer at SoundCloud. He started his journey there three years ago. As a generalist, he has worked on various parts of their architecture. Nowadays he is part of the Core Engineering, where he helps to build and integrate the core business services of SoundCloud... Read More →


Tuesday July 12, 2016 16:00 - 16:20 IST
Pembroke Room

16:20 IST

My Scariest Day: When Things Go All Wrong

Lightning Talks session


Moderators
avatar for Gareth Eason

Gareth Eason

Engineering Manager, Schibsted Product & Technology
Gareth works as an SRE Manager with Schibsted Product & Technology group. In the past, Gareth has worked with Nokia, Cable & Wireless, HEAnet, Google and Facebook. Come ask me about Linux systems, care and feeding of large systems, CDN infrastructure or the successful use of Raspberry... Read More →
avatar for John Looney

John Looney

SREconEU Program Chair

Tuesday July 12, 2016 16:20 - 17:00 IST
Lansdowne

16:20 IST

Downtime Budgets
The concept of the error budget is a great way to hack SLAs and make them into a positive tool for system engineers. But how can you take the same idea from a world that handles millions of transactions in a day to one that handles hundreds, but on the same hardware scale? High Performance Computing jobs run for hours, days, or weeks at a time, resulting in unique challenges related to system availability, maintenance, and experimentation. In this talk I will explore how we plan to modify the error budget concept to fit in an HPC environment by applying the same idea to cluster outages. With that specific example as the foundation, I will conclude with some thoughts on how the ideas generated in large scale web environments can be used in similarly sized environments running very different workloads.

Speakers
avatar for Cory Lueninghoener

Cory Lueninghoener

Wireless Couch Labs
Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested... Read More →


Tuesday July 12, 2016 16:20 - 17:00 IST
Pembroke Room

17:30 IST

Happy Hour, Sponsored by Facebook
Tuesday July 12, 2016 17:30 - 18:30 IST
~Herbert

19:00 IST

Tuesday BoFs
Ad hoc topics of interest - for current details, please see: https://www.usenix.org/conference/srecon16europe/bofs

BoFs may be scheduled in advance by contacting bofs@usenix.org with "SREcon16 Europe BoF" in the subject line and the following information in the body of the email:

  1. BoF title
  2. Organizer name and affiliation
  3. Date and time preference
  4. Brief description of BoF (optional)


Tuesday July 12, 2016 19:00 - 22:00 IST
TBA
 
Wednesday, July 13
 

08:00 IST

Morning Coffee & Tea
Wednesday July 13, 2016 08:00 - 09:00 IST
Pre-Function Area

09:00 IST

Past, Present, and Future of Network Operations

Historically the network has lacked the skills, the tools and even the means to fully embrace automation or build abstractions for the rest of the organization to consume. However, the tide is changing and most modern equipment nowadays provide standard linux tools or open APIs to interact with them.

In this talk, we will explore how to build network abstractions and leverage on the experience gathered by the devops community over the years to expose the network to the organization, increase agility and provide situational awareness.


Speakers
avatar for David Barroso

David Barroso

Network Systems Engineer, Fastly
David is a Network Systems Engineer at Fastly where he spends his time dealing with the network with code and thinking in ways of integrating it with the application.



Wednesday July 13, 2016 09:00 - 09:20 IST
Pembroke Room

09:00 IST

Relieving Technical Debt through Short Projects
It's easy to plan out month-long or year-long projects, or to have an interrupts rotation for dealing with oncall/tickets, but how do you make sure you're also doing the short week-long projects that can relieve your technical debt? I'll cover a planning approach that my team found that makes room for all three sets of work, reducing in the long term the operational burden of the services we operate.

Speakers
avatar for Liz Fong-Jones

Liz Fong-Jones

Developer Advocate, Activist, and Site Reliability Engineer, Google
Liz is a Staff Site Reliability Engineer at Google and works on the Google Cloud Customer Reliability Engineering team in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates... Read More →


Wednesday July 13, 2016 09:00 - 09:20 IST
Lansdowne

09:00 IST

Distributed Log-Processing Design Workshop
Limited Capacity seats available

Participants will have the opportunity to try their hand on designing a reliable, distributed, multi-datacenter near-real-time log processing system.

The session will start with a short presentation on lessons learned about designing reliable distributed systems, and then participants will break out in small groups, assisted by Google facilitators, and try their hand at solving a real-world design challenge, from high-level architecture down to an estimate of the computing resources required to run the service.

The session will likely appeal to experienced engineers who want to have fun tackling a real-world design problem faced by many teams in Google.


Speakers
avatar for Andrea Spadaccini

Andrea Spadaccini

SRE Manager, Google


Wednesday July 13, 2016 09:00 - 12:40 IST
Munster

09:00 IST

Docker From Scratch
Limited Capacity seats available

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Speakers
avatar for Nati Cohen

Nati Cohen

HERE Mobility
Nati Cohen is a Production Engineer at Here Technologies and a Teaching Assistant at the Interdisciplinary Center Herzliya. Previous experience includes: operations consulting, software development, *nix administration and security research in the Intelligence Corps as well as in... Read More →
avatar for Avishai Ish-Shalom

Avishai Ish-Shalom

Avishai Ish-Shalom is a veteran Ops and a survivor of many production skirmishes. Avishai helps companies deal with web era operations and scale as an independent consultant. In his spare time Avishai is spreading weird ideas and conspiracy theories such as DevOps.


Wednesday July 13, 2016 09:00 - 14:40 IST
Ulster

09:20 IST

Bridging Multicast to the Cloud
As more organizations move their workloads to cloud providers, they may discover small gotchas that prevent them from easily running the same applications that they’re used to within a traditional on-premises environment. One such example is multicast: none of the big players support multicast traffic between nodes in their cloud offerings. The solution? Overlay a mesh virtual network - n2n - that supports multicast between nodes and can be extended to include on-prem systems. In this talk, we’ll go over how to implement n2n in a resilient, scalable fashion to link on-prem and cloud environments through a gateway.

Speakers
avatar for Daniel Emord

Daniel Emord

Lead Site Reliability Consultant, Pythian
Dan Emord has been designing and deploying multiplatform solutions for 8 years and is currently a Lead Site Reliability Consultant at Pythian. Dan runs the gamut of client requests with Pythian’s open scope engagements, including architecting and implementing a wide variety of... Read More →


Wednesday July 13, 2016 09:20 - 09:40 IST
Pembroke Room

09:20 IST

Running Storage at Facebook
The mission of the Data Warehouse Storage team at Facebook is to run an HDFS deployment that stores hundreds of petabytes reliably and efficiently.

Due to its stateful nature, there are some unique challenges to operating a storage system. How do we take a machine out of production for repair without compromising data availability? What trade-offs do we make between replication strategy and data availability to make sure we get more bang for the buck? How do we ensure that colocated tasks that run on our storage nodes exploit available resources as much as they can without tipping hosts over?

In this talk, we'll share some of the lessons that we have learnt from running HDFS at Facebook. We will discuss our biggest operational challenges and we'll outline the evolution of the different solutions that we put in have in place over time. We will also introduce Warm Storage, a novel block storage system that we built at Facebook to replace HDFS, and we'll discuss how our learnings from HDFS have affected the design of Warm Storage.

Speakers
avatar for Federico Piccinini

Federico Piccinini

Production Engineer, Facebook
Federico is a Production Engineer at Facebook and has been working for the past one and a half year on large scale block storage systems. Before Facebook, Federico help running the Storage infrastructure at Spotify. Likes: open source, large scale distributed systems and Broadway... Read More →


Wednesday July 13, 2016 09:20 - 09:40 IST
Lansdowne

09:40 IST

Full-Mesh IPsec Network: 10 Dos and 500 Don'ts
How do you secure your internal network when your servers are located on different continents with different providers and you don't trust your network?

IPSec is a great way to secure a network but it's usually deployed as a way of connecting a small group of trusted networks, and both tools and existing documentation reflect this. This is not really an option in some environments where you don't really control the network and want to interoperate across different providers, so you find yourself sailing through uncharted waters at times when trying to build a fully meshed network with IPSec, where each server can establish a secure connection to any other server in its cluster.

We wanted any of our servers around the world to be able to communicate securely with any other. We were using a peer to peer VPN, but it broke down badly at scale and we chose to go with IPSec. It wasn't a smooth transition; the tools were terrible, the documentation was vague and incomplete and we found some horrible bugs, but we survived and want to share with you some of the lessons we learned, what you definitely shouldn't do, and why you might want to do this.

Speakers
avatar for Fran Garcia

Fran Garcia

SRE, Hosted Graphite
Currently the SRE team lead at Hosted Graphite, Fran has previously been mostly responsible for causing (and occasionally preventing) outages in varied fields such as advertising, online gaming and sports betting. Do not ask him about chatops.


Wednesday July 13, 2016 09:40 - 10:20 IST
Pembroke Room

09:40 IST

Linux Kernel Building, Testing and Deployment at Facebook
The kernel team at Facebook works on both features and fixes for the upstream Linux community, as well as pulling in patches to apply to the kernels run in the Facebook production fleet. This is done in order to support new and upcoming hardware variations, as well as fix standing issues in the environment and improve performance. We aim to roll out a new kernel to a large portion of the fleet, as often as possible.

In this talk, we will explore how the kernel PE team has worked to automate the build and install process, rolling a canary of the newly built kernels every day, and gathering thousands of tests to validate each kernel before we push out to other tiers to upgrade. We run a series of integration tests across multiple hardware types and generations, do performance and correctness tests on the newly built kernels, and release the new kernel through multiple phases of release candidates and canary groups to gain confidence in the new builds. By doing this, we get a baseline for expectations when moving to the new kernel. We then work with individual tier owners to handle the upgrade, allowing for a regular kernel release and a sustainable support model in a way that is compatible with all of the different services. All this work allow us to run kernels that are as close as possible to the upstream releases.

Speakers
YB

Yannick Brosseau

Production Engineer, Facebook
Yannick Brosseau is a Production Engineer on the Kernel team at Facebook. As such he works on improving the stability and performance of the kernels deployed on the Facebook infrastructure and develops testing, monitoring and deployment tools to help in this endeavor. Previously... Read More →
avatar for Phillip Duncan

Phillip Duncan

Production Engineer, Facebook


Wednesday July 13, 2016 09:40 - 10:20 IST
Lansdowne

10:20 IST

Break with Refreshments
Wednesday July 13, 2016 10:20 - 11:00 IST
Pre-Function Area

11:00 IST

Scaling Shopify's Multi-Tenant Architecture across Multiple Datacenters
Multi-tenant architectures are a very convenient and economical way to share resources like web servers, job workers, and datastores among several customers on your platform. Even the smallest Shopify store on a $9/month plan can easily survive getting hammered with a 1M RPM flash sale by leveraging the resources of the entire platform. However, architectures like this can also have several drawbacks. They are potentially harder to scale and things like resource starvation or back-end outages are harder to isolate.

In this talk, I’m going to walk you through the history of how Shopify grew from being a small standard single-database single-datacenter Rails application to the multi-database multi-datacenter setup that we run today. We will talk about the advantages in terms of resiliency, scalability, and disaster recovery that this architecture gives us, how we got there, and where we want to go in the future.

You will learn about things like how to use the Border Gateway Protocol and Equal-Cost Multi-Path routing for implementing intra-datacenter high availability, how we implement our own load balancing algorithms, what it takes to prepare a Ruby on Rails application for a move like this, and how we do completely scripted datacenter failovers in a matter of seconds with no considerable downtime.

Speakers
FW

Florian Weingarten

Shopify
Florian is a production engineer at Shopify. For the past 5 years, he has been working on all aspects of Shopify's sharding and multi-tenancy stack, including resiliency, region failovers, load distribution and isolation, shard rebalancing, as well as Shopify's migration to Google... Read More →


Wednesday July 13, 2016 11:00 - 11:40 IST
Pembroke Room

11:00 IST

Extreme OS Kernel Testing

Fuzz testing has been used to evaluate the robustness of operating system distributions for over twenty years. Eventually, a fuzz test suite will suffer from reduced effectiveness. The first obstacle is the pesticide paradox: as you fix the easy defects, it gets difficult to find the remaining obscure defects. Also, the test execution time and the debug/fix cycle tends to be manual work that can take hours or even days of effort. During the presentation, a structured framework for creating new fuzz tests will be introduced, along with a competitive analysis approach used to minimize defect reproduction complexity.


Speakers
avatar for Kirk Russell

Kirk Russell

Production Engineer, Shopify
Kirk is currently a Production Engineer at Shopify, making sure that our docker image build system can keep up with 22 launches a day.


Wednesday July 13, 2016 11:00 - 11:40 IST
Lansdowne

11:40 IST

Leading a Team with Values
Having a small set of authentic, opinionated, collaboratively formed core values can be the magic ingredient to building a high performing, happy team. In this talk you'll hear the story of how, in the space of a few short months, the Intercom Ops team went from doing OK to AWESOME. We'll tell you about our core values, our values for creating values, our happiness metrics and finally, about how this approach can be applied to other teams.

Speakers
avatar for Rich Archbold

Rich Archbold

Director of Engineering, Intercom
Richard Archbold is an Engineering Director at Intercom, a highly successful and fast growing Irish technology startup company that provides customer communication software to Internet businesses. Intercom's mission is to make web business personal. Previous to Intercom, Richard has... Read More →


Wednesday July 13, 2016 11:40 - 12:20 IST
Pembroke Room

11:40 IST

DNS: Old solution for modern problems

As infrastructure becomes more complex, dynamic, and diverse service discovery becomes very important.

There are many solutions to this problem (thrift, rest.li, custom-zk, etc.) all of which require application changes which precludes the use of off-the-shelf software.

We have applications at LinkedIn where it isn't practical to integrate with our internal service discovery systems. After some thought we decided that all of these applications do support a common service discovery system: our old friend DNS.

In this presentation, we'll talk about how we implemented a distributed, highly available, eventually consistent service discovery system using DNS written in Go. We'll talk about the design, implementation, and challenges encountered on the way to production.

We'll focus on:

  • Architecture
  • Extensibility
  • Availabilty
  • Operability

The Results:

  • Significantly reduced complexity
  • Dramatic decrease in convergence time
  • Ubiquitous service discovery
  • Leverage existing DNS infrastructure

Speakers
avatar for Rauf Guliyev

Rauf Guliyev

I am a Traffic SRE at LinkedIn responsible for shuffling bits between devices around the world and LinkedIn's service infrastructure. I like to solve all kinds of engineering problems, so I spend my free time building and an exoskeleton race kit car.
TJ

Thomas Jackson

https://www.linkedin.com/in/jacksontj


Wednesday July 13, 2016 11:40 - 12:20 IST
Lansdowne

12:20 IST

Conference Luncheon
Wednesday July 13, 2016 12:20 - 13:40 IST
Sussex Restaurant

13:40 IST

Active Fault Finding in Networks
One of the key principles of SRE is always knowing your service is broken before your customers notice. Network devices are typically black boxes that fail in mysterious ways. The tradiational methods of network monitoring don't scale well. They also have a key fundamental flaw. Most network monitoring involves asking a network device if it's dropping packets. If that device is fundamentally unhealthy, should you really trust what it tells you about what it is doing?

Facebook has taken a different approach to network monitoring where we actively probe our datacenter networks from as many locations as possible to ensure the network is behaving as we expect. The interesting past comes when we detect loss between different hosts on the network. How do we discover which network device or individual link in the thousands that are available is the root cause of this loss? By combining some magic requests with some basic math, can we automate that detection and triangulate it to a specific device that does not even know it's dropping packets?

Speakers
avatar for Richard Sheehan

Richard Sheehan

Production Engineer, Facebook
Production Engineer at Facebook with lots of Networking related experience. Spent 10 years at Amazon working on DNS, Load-balancing and CDN stuff. Never cried at my desk, not even once. Now build large scale network monitoring and fault isolation solutions at Facebook.


Wednesday July 13, 2016 13:40 - 14:20 IST
Pembroke Room

13:40 IST

The Knowledge: Towards a Culture of Engineering Documentation

For several years, Google's internal surveys identified the lack of trustworthy, discoverable documentation as the #1 problem impacting internal developer productivity. We're not alone: Stack Overflow's 2016 survey ranked ""Poor documentation"" as the #2 problem facing engineers. (insert that quote from NYC SRE here re SREs ""living and dying"" by the docs)

Solving this problem is tough. It's not enough to build tooling; the culture needs to change. Google internal engineering is attacking the challenge three ways: Building a documentation platform; integrating that platform into the engineering toolchain; and building a culture where documentation - like testing - is accepted as a natural, required part of the development process.

In this talk, we'll share our learnings and best practices around both tooling and culture, the evolution of documentation, and some thoughts about how we can transition from the creation of documents towards an ecosystem where context-appropriate, trustworthy documentation is reliably and effortlessly available to the engineers that need it.


Speakers
RM

Riona MacNamara

Staff technical writer, Google
Riona is senior staff technical writer at Google, where she has worked for 11 years, and leads the team that builds g3doc, Google's internal platform for engineering documentation, used by thousands of projects within the company. Before Google, she worked at Amazon and spent almost... Read More →


Wednesday July 13, 2016 13:40 - 14:20 IST
Lansdowne

13:40 IST

Lightning Talks
To sign up for a lightning talk, write on the board outside of Munster Suite.

Wednesday July 13, 2016 13:40 - 14:40 IST
Munster

14:20 IST

Fixing the Internet for Real-Time Applications (Games)
League of Legends (LoL) is not a game of seconds, but of milliseconds. In day-to-day life, two seconds fly by unnoticed but in-game a two-second stun can feel like an eternity. In any single match of LoL, thousands of decisions made in milliseconds dictate which team scores bragging rights and which settles for “honorable opponent” points. The Internet, however, was not constructed for applications that run like this, essentially in real time. 

This talk will discuss the steps Riot Games has taken, and will continue to take, to fix this fundamental problem with commodity Internet, with a specific focus on the work done to improve the experience of our European players.

Speakers
avatar for Adam Comerford

Adam Comerford

Senior Systems Engineer, Riot Games
Adam Comerford is currently a Senior Systems Engineer at Riot Games in Dublin and obsessed with improving the League of Legends experience for players in Europe (and beyond). He has a broad technical background spanning 15+ years and multiple disciplines including networking, distributed... Read More →


Wednesday July 13, 2016 14:20 - 14:40 IST
Pembroke Room

14:20 IST

Dropbox's Naoru: Bridging the Safety Gap from Scripts to Full Auto-Remediation
In Dropbox automation, to bridge the gap between “scripts” and “fully automatic automation”, we’ve introduced a concept of “Human Authorized Execution”. This means that a tool automatically finds problems and decides how to fix them, but a human operator is required to audit the tool’s decisions before the automation may run.

Why do we need this? Frankly, it’s terrifying to have automation run fully automatically. With a human involved, their subconscious can answer a really important question: Why might I NOT want to run this script? If we took a simple approach, for instance deploying a cron job to run our scripts whenever alerts fire, then we would lose that human’s sense of paranoia and danger.

At Dropbox, we’ve built an alert auto-remediation platform called Naoru, which forces us to build our automation in a way that adheres to these principles. In this talk we will discuss the thought process we bring towards building trustworthy automation, how Naoru forces our engineers to follow these philosophies, and how we’ve driven our infrastructure organization towards a culture of embracing trustworthy automation.

Speakers
avatar for David Mah

David Mah

Dropbox
David Mah is a Site Reliability Engineer at Dropbox who has built several monitoring mechanisms across Dropbox’s block storage and server file system infrastructure. He is also the author of Dropbox’s auto-remediation infrastructure.


Wednesday July 13, 2016 14:20 - 14:40 IST
Lansdowne

14:40 IST

Break with Refreshments
Wednesday July 13, 2016 14:40 - 15:00 IST
Pre-Function Area

15:00 IST

Data Privacy Legislation and the Impact on SRE
Speakers
avatar for John Looney

John Looney

SREconEU Program Chair


Wednesday July 13, 2016 15:00 - 15:40 IST
Lansdowne+Pembroke

15:40 IST

Techniques and Tools for a Coherent Discussion about Performance in Complex Architectures
Most applications today have separate networked services measuring in the tens to hundreds; especially with the growing popularity of micro services. Crossing the boundary between these services often means a change in team and even a change in programming languages. In this session I will discuss the challenges this presents, why it is important to have a single engineering conversation about performance and how we can accomplish this.

Speakers
TS

Theo Schlossnagle

Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed... Read More →


Wednesday July 13, 2016 15:40 - 16:20 IST
Lansdowne+Pembroke

16:20 IST

Government Needs SRE
In 2013, Mikey Dickerson joined what became known as the “ad hoc” team, tasked with rescuing HealthCare.gov after its failed launch on October 1. In August 2014, President Obama established the United States Digital Service and appointed Mikey to serve as the Administrator to see if the strategy that succeeded at pulling Healthcare.gov out of the fire could be applied to other government problems. Now nearly 2 years old and about 150 people spanning a network of federal agencies, the U.S. Digital Service has taken on immigration, education, veterans benefits, and health data interoperability. The U.S. Digital Service is helping agencies build effective government services and improve IT procurements by focusing on industry best practices and agile methodology, ultimately driving change in the largest institution in history. Prior to joining the U.S. Digital Service, Mikey worked as a Site Reliability Manager at Google.

Speakers
MD

Mikey Dickerson

In 2013, Mikey Dickerson joined what became known as the “ad hoc” team, tasked with rescuing HealthCare.gov after its failed launch on October 1. In August 2014, President Obama established the United States Digital Service and appointed Mikey to serve as the Administrator to... Read More →


Wednesday July 13, 2016 16:20 - 17:00 IST
Lansdowne+Pembroke
 
Filter sessions
Apply filters to sessions.