This event has ended. Visit the official site or create your own event on Sched.
Wednesday, July 13 • 13:40 - 14:20
Active Fault Finding in Networks

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

One of the key principles of SRE is always knowing your service is broken before your customers notice. Network devices are typically black boxes that fail in mysterious ways. The tradiational methods of network monitoring don't scale well. They also have a key fundamental flaw. Most network monitoring involves asking a network device if it's dropping packets. If that device is fundamentally unhealthy, should you really trust what it tells you about what it is doing?

Facebook has taken a different approach to network monitoring where we actively probe our datacenter networks from as many locations as possible to ensure the network is behaving as we expect. The interesting past comes when we detect loss between different hosts on the network. How do we discover which network device or individual link in the thousands that are available is the root cause of this loss? By combining some magic requests with some basic math, can we automate that detection and triangulate it to a specific device that does not even know it's dropping packets?

avatar for Richard Sheehan

Richard Sheehan

Production Engineer, Facebook
Production Engineer at Facebook with lots of Networking related experience. Spent 10 years at Amazon working on DNS, Load-balancing and CDN stuff. Never cried at my desk, not even once. Now build large scale network monitoring and fault isolation solutions at Facebook.

Wednesday July 13, 2016 13:40 - 14:20 IST
Pembroke Room