In this episode, we sit down with Olga Hall, Global Head of Amazon Video Availability and Scale to discuss Chaos Engineering, experimentation, and how to do it at scale. You won’t want to miss this episode as we learn from Olga about availability fairytales and how failure really is inevitable. We give some inside scoop on what is happening Chaos Community Day London.
In this episode, we sit down with Dr. Peter Alvaro of Disorderly Labs and discuss chaos, failure, academics, and computer science. The topics range from data provenance to fault injection but hinge on how Peter’s LDFI research is impacting the field of chaos engineering. Also, Casey tries to talk Peter into a way to use his research as a way to pull off a bank heist, and of course, James tries to be the voice of reason.
In this episode, we experience a royal reception with Crystal Hirschorn, the Director Of Engineering at Snyk, to talk about chaos engineering and how to put those practices to work in organizations of any size. Crystal shares advice with us on how to get VP’s and others on board with chaos engineering, which looks quite different from bringing engineers along on the journey.
Throughout the conversation you will see how chaos engineering has matured from the early days of Chaos Monkey from Netflix and how it is in practice at some of the largest organizations across the globe. Crystal discusses how resilience engineering moves from academia and can be put into practice in real organizations. It helps us understand socio-technical systems and how they come into play when we write software.
In this episode, we are joined by John Allspaw the Co-Founder of Adaptive Capacity Labs.
John instructs us on how to deal with the bad apples in our organizations, and debates whether
or not we can find root cause and human error—whether the person is embodied or not. If you
are interested in accident investigations, uncovering truth in complex systems, or using
heuristics and correlation to find meaning, then this is the episode for you. Special
appearances include Dr. Richard Cook and John’s sister, Sue Allspaw, asking hard-hitting
questions that leave John speechless.
In this episode Julie explains how communication fits in your DevOps (and Chaos Engineering)
efforts and she dissects the misuse of ITIL and root cause thinking as challenges to building
a learning organization. After hearing her talk The Psychology of Chaos Engineering at DevOpsDays
Raleigh, we knew she had to come on the show. Julie explains how to approach digital and cloud
transformations as well as delivers some hilarious one-liners, making this a show to remember.
Russ Miles is a prolific writer and speaker on Chaos Engineering, especially well-known as
an evangelist of the discipline in his home town of London. James and Casey discuss his chapter
in Chaos Engineering: System Resiliency in Practice
which is titled, “Open Minds, Open Science,
and Open Chaos.” And as if that isn’t a mouthful, we discuss his previous published works and
his Chaos Engineering-themed tour of the United States.
Our special guest Nora Jones and our host Casey Rosenthal are the two authors of O’Reilly’s
Chaos Engineering: System Resiliency in Practice,
the definitive book on the subject. We ask Nora about learning from incidents, how to
facilitate a healthy Chaos Engineering program, the path that led to her writing this book
with Casey, how her thinking has changed over the years, and her startup Jeli.
Introducing the first contributing author from Chaos Engineering: System Resiliency in Practice
in our series on the book, Nathan wrote Chapter 17: “Let’s Get Cyber-Physical.”