Loading…
This event has ended. Visit the official site or create your own event on Sched.
View analytic
Tuesday, July 12 • 16:20 - 17:00
Downtime Budgets

Sign up or log in to save this to your schedule and see who's attending!

The concept of the error budget is a great way to hack SLAs and make them into a positive tool for system engineers. But how can you take the same idea from a world that handles millions of transactions in a day to one that handles hundreds, but on the same hardware scale? High Performance Computing jobs run for hours, days, or weeks at a time, resulting in unique challenges related to system availability, maintenance, and experimentation. In this talk I will explore how we plan to modify the error budget concept to fit in an HPC environment by applying the same idea to cluster outages. With that specific example as the foundation, I will conclude with some thoughts on how the ideas generated in large scale web environments can be used in similarly sized environments running very different workloads.

Speakers
avatar for Cory Lueninghoener

Cory Lueninghoener

Wireless Couch Labs
Cory Lueninghoener leads the HPC Design Group at Los Alamos National Laboratory. He has helped design, build, and manage some of the largest scientific computing resources in the world, including systems ranging in size from 100,000 to 900,000 processors. He is especially interested... Read More →


Tuesday July 12, 2016 16:20 - 17:00
Pembroke Room

Attendees (26)