Guest Lecture: SRE, Service Levels, and Error Budgets

Guest Speaker: Mr. Adam Brady

Time: Wednesday, Apr 6, 2022. 2:30pm - 3:45pm Central.

Location: Zoom link is posted on Piazza.

For CS4278/5278 students (only apply to students in Dr. Yu Huang's session), you are required to change your Zoom username to "$VUID-$NAME-CS4278" (e.g., huany47-Yu Huang-CS4278).

Abstract

Software Engineering as a discipline often focuses on designing and building software rather than operating and maintaining it. Despite this, between 40% and 90% of the total cost of software is incurred after launch, and major software outages still make national news. This talk introduces practical concepts for how to reason about software reliability, and how Google formed a specialized job role that treats software operations as a software engineering problem. Site Reliability Engineering (SRE) is a set of practices, guidelines, and culture that help keep software running, but can also help regular software engineers build cool things with reliability in mind.

About Adam Brady

Adam Brady is a Software Engineer (SWE) at Google Pittsburgh and a graduate of the Site Reliability Engineering "Mission Control" program. He has been oncall for one of the largest pieces of machine learning training infrastructure in the world, and broke it for nearly three days. Somehow, he works on the same system that he once broke, and found his niche doing frontend work and building tools to manage the model development process. He received a B.S. in Computer Science from West Virginia University and a M.S. in Computer Science from the University of Virginia, thereby successfully earning degrees from both Virginias.