If Rip Van Winkle had gone to sleep around 2006 and woken up 10 years later, he'd find the world a strange brew of the new and the old. He'd be amazed that phones had grown a brain, dismayed that a most excellent rendition of the Dark Knight had wandered back to the wasteland as most Dark Knight capers do. People had warmed upto electric cars, but not to climate change. And, if Ol' Rip were a network operations guy at some of the large webscale companies, he might think he'd died and woken up in heaven. Networks were no longer slow as molasses: to deploy, manage and upgrade. He'd find some things had stayed the same (IPv4 still ruled the roost), and some others not so much. He would be puzzled by the terminology and the discussions as he wandered the hallways. SDN, Open networking, Openflow, microservices, Ansible, Puppet, Kubernetes, and so on.
This tutorial is an attempt to bring folks up to speed on whats happened with networking in the past 10 years or so, especially in the data center, concluding with some thoughts on why exciting times lie ahead. The talk will be roughly divided into the following sections:
The tutorial will include demos and hands on work with some modern tools.
The audience is expected to be aware of basic networking (bridging, routing, broadcast, multicast etc.).
The key takeways from this talk will be:
Some preliminary ideas for hands on work:
eBPF (extended Berkeley Packet Filters) is a modern kernel technology that can be used to introduce dynamic tracing into a system that wasn't prepared or instrumented in any way. The tracing programs run in the kernel, are guaranteed to never crash or hang your system, and can probe every module and function -- from the kernel to user-space frameworks such as Node and Ruby.
In this workshop, you will experiment with Linux dynamic tracing first-hand. First, you will explore BCC, the BPF Compiler Collection, which is a set of tools and libraries for dynamic tracing. Many of your tracing needs will be answered by BCC, and you will experiment with memory leak analysis, generic function tracing, kernel tracepoints, static tracepoints in user-space programs, and the "baked" tools for file I/O, network, and CPU analysis. You'll be able to choose between working on a set of hands-on labs prepared by the instructors, or trying the tools out on your own test system.
Next, you will hack on some of the bleeding edge tools in the BCC toolkit, and build a couple of simple tools of your own. You'll be able to pick from a curated list of GitHub issues for the BCC project, a set of hands-on labs with known "school solutions", and an open-ended list of problems that need tools for effective analysis. At the end of this workshop, you will be equipped with a toolbox for diagnosing issues in the field, as well as a framework for building your own tools when the generic ones do not suffice.
In many companies, including Microsoft, SRE is not yet an integrated part of the operational landscape. Instead it is being actively adapted into mature companies. Our team has been working to develop new and interesting ways to introduce SRE and its tenets to an organization with many different operational approaches ranging from IT Ops to DevOps.
The process of introducing SRE has proven to be quite complex and socially delicate: you can't go in to a team and just tell them they are doing things wrong. You need to find the right way to show a developer all the warts on their baby and motivate them to work with you on addressing them. Furthermore, you have to deal with their earnest desire to treat you as "just another ops team" who is only there to take the pager from them.
One of the tools we've used to enable the right conversations is to hold what we call a Service Roast. Named after the famous friar's club roasts, the goal is to establish a safe environment to dig into and expose those warts, wrinkles, design flaws, shortcomings, and problems everyone knows a service has but doesn't want to talk about. We can't help you if you won't tell us where it hurts.
To perform the Service Roasts, we've discovered some process, ground rules, a new role of impartial moderator, and some useful structure to host this kind of meeting. Thus far we've been able to obtain great insight into some of our services and more importantly created some very interesting (and lively) conversations.
To be sure, this is a high-risk activity, and shouldn't be done without careful consideration of the teams participating, but we'll present what we've learned about holding these roasts, guidance teams need for successful participation, and (importantly) why we don't use this approach everywhere.
Monitoring and dashboarding systems are crucial to understanding the behavior of large distributed systems. But monitoring systems can lead you on wild goose chases, or hide issues. In this talk, I will look at some examples of how a monitoring system can lie to you – in order to sensitize the audience to these failure modes and encourage them to look for similar examples in their own systems.
Imagine you're tackling one of these evasive performance issues in the field, and your go-to monitoring checklist doesn't seem to cut it. There are plenty of suspects, but they are moving around rapidly and you need more logs, more data, more in-depth information to make a diagnosis. Maybe you've heard about DTrace, or even used it, and are yearning for a similar toolkit, which can plug dynamic tracing into a system that wasn't prepared or instrumented in any way.
Hopefully, you won't have to yearn for a lot longer. eBPF (extended Berkeley Packet Filters) is a kernel technology that enables a plethora of diagnostic scenarios by introducing dynamic, safe, low-overhead, efficient programs that run in the context of your live kernel. Sure, BPF programs can attach to sockets; but more interestingly, they can attach to kprobes and uprobes, static kernel tracepoints, and even user-mode static probes. And modern BPF programs have access to a wide set of instructions and data structures, which means you can collect valuable information and analyze it on-the-fly, without spilling it to huge files and reading them from user space.
In this talk, we will introduce BCC, the BPF Compiler Collection, which is an open set of tools and libraries for dynamic tracing on Linux. Some tools are easy and ready to use, such as execsnoop, fileslower, and memleak. Other tools such as trace and argdist require more sophistication and can be used as a Swiss Army knife for a variety of scenarios. We will spend most of the time demonstrating the power of modern dynamic tracing -- from memory leaks to static probes in Ruby, Node, and Java programs, from slow file I/O to monitoring network traffic. Finally, we will discuss building our own tools using the Python and Lua bindings to BCC, and its LLVM backend.
API Management—Why Speed Matters
Arianna Aondio, Varnish Software
Reverse Engineering the “Human API” for Automation and Profit
Nati Cohen, SimilarWeb
What a 17th Century Samurai Taught Me about Being an SRE
Caskey L. Dickson, Microsoft
Chatops/Automation: How to get there while everything's on fire
Fran Garcia, Hosted Graphite
Sysdig Love
Alejandro Brito Monedero, Alea Solutions
Automations with Saltstack
Effie Mouzeli, Logicea, LLC
Myths of Network Automation
David Rothera, Facebook
DNS @ Shopify
Emil Stolarsky, Shopify
Hashing Infrastructures
Jimmy Tang, Rapid7
This workshop is a part of the "full lifecycle" workshop track which includes Post-Mortems, Incident Response, and Effective Design Review Participation. Using several example cases, participants in this session will learn to apply a variety of different points of view to analyze a design for issues which could affect its reliability and operability.
The sample designs and play list can be found at https://goo.gl/VIiN6i - now updated with the comments and suggestions that came in during the workshop.
Do you have services where the owners claim they run at five 9's but you often run into errors? It's very easy and convenient to build metrics at the service level. These often hide a wide array of issues that users might face. Having the right metrics is a key component of building sustainable SRE culture. This talk goes into the design of these metrics, real world examples to illustrate good/bad designs.
Lightning Talks session
BoFs may be scheduled in advance by contacting bofs@usenix.org with "SREcon16 Europe BoF" in the subject line and the following information in the body of the email:
Historically the network has lacked the skills, the tools and even the means to fully embrace automation or build abstractions for the rest of the organization to consume. However, the tide is changing and most modern equipment nowadays provide standard linux tools or open APIs to interact with them.
In this talk, we will explore how to build network abstractions and leverage on the experience gathered by the devops community over the years to expose the network to the organization, increase agility and provide situational awareness.
Participants will have the opportunity to try their hand on designing a reliable, distributed, multi-datacenter near-real-time log processing system.
The session will start with a short presentation on lessons learned about designing reliable distributed systems, and then participants will break out in small groups, assisted by Google facilitators, and try their hand at solving a real-world design challenge, from high-level architecture down to an estimate of the computing resources required to run the service.
The session will likely appeal to experienced engineers who want to have fun tackling a real-world design problem faced by many teams in Google.
Fuzz testing has been used to evaluate the robustness of operating system distributions for over twenty years. Eventually, a fuzz test suite will suffer from reduced effectiveness. The first obstacle is the pesticide paradox: as you fix the easy defects, it gets difficult to find the remaining obscure defects. Also, the test execution time and the debug/fix cycle tends to be manual work that can take hours or even days of effort. During the presentation, a structured framework for creating new fuzz tests will be introduced, along with a competitive analysis approach used to minimize defect reproduction complexity.
As infrastructure becomes more complex, dynamic, and diverse service discovery becomes very important.
There are many solutions to this problem (thrift, rest.li, custom-zk, etc.) all of which require application changes which precludes the use of off-the-shelf software.
We have applications at LinkedIn where it isn't practical to integrate with our internal service discovery systems. After some thought we decided that all of these applications do support a common service discovery system: our old friend DNS.
In this presentation, we'll talk about how we implemented a distributed, highly available, eventually consistent service discovery system using DNS written in Go. We'll talk about the design, implementation, and challenges encountered on the way to production.
We'll focus on:
The Results:
For several years, Google's internal surveys identified the lack of trustworthy, discoverable documentation as the #1 problem impacting internal developer productivity. We're not alone: Stack Overflow's 2016 survey ranked ""Poor documentation"" as the #2 problem facing engineers. (insert that quote from NYC SRE here re SREs ""living and dying"" by the docs)
Solving this problem is tough. It's not enough to build tooling; the culture needs to change. Google internal engineering is attacking the challenge three ways: Building a documentation platform; integrating that platform into the engineering toolchain; and building a culture where documentation - like testing - is accepted as a natural, required part of the development process.
In this talk, we'll share our learnings and best practices around both tooling and culture, the evolution of documentation, and some thoughts about how we can transition from the creation of documents towards an ecosystem where context-appropriate, trustworthy documentation is reliably and effortlessly available to the engineers that need it.