Resilience Bites #2 - What the Internet said last month!
Feb 1, 2025
In this January edition of Resilience Bites, I’ve curated a selection of noteworthy discussions and findings related to resilience from this past month.
I hope you find these highlights insightful and that they inspire you to explore the world of resilience engineering further.
My 'coups de cœur' have one ❤️ or two next to it.
Happy reading!
Must-check highlights
I stumbled upon a great series of blog posts by Uwe Friedrichsen about Resilience. It's one of the best pieces I've read on this topic in a long time. He covers important basics that everyone needs to understand. Throughout the series, he compares building resilience to climbing a mountain. If you care about Resilience or if your organization is working on it, you really need to read this. ❤️❤️
Catchpoint's SRE 2025 report - Some interesting but not surprising points.
"Slow is the new down" - 53% of organizations agree that poor performance is as bad as downtime. For the first time in five years, toil has increased to 30% from 25% in 2024, despite advancements in automation and AI. SRE teams are still trying to figure out how to use AI. Over two-thirds of respondents feel pressured to prioritize release schedules over reliability. This isn't surprising. I see it all the time. I've talked about the prevention paradox before - this is it in action. Rapid development and maintaining system stability is a difficult tradeoff for organizations. It's comparing creation vs prevention, and the human brain isn't good at it. See my LinkedIn post on the prevention paradox. 40% of respondents reported handling between 1 and 5 incidents in the past 30 days. There are significant differences in perceptions of reliability practices between individual contributors and higher management. Again, I am not surprised. I often see reliability goals that are not aligned with business goals, making them difficult to justify.
The Danger of Overreaction by Lorin Hochstein (to read after the above) ❤️
“The decisions we make always carry risk because of the uncertainties: we just can’t predict the future well enough to understand how our actions will reshape the risks. Remember that the next time people rush to address the risks exposed by the last major incident. Because the fact that an incident just happened does not improve your ability to predict the future, no matter how severe that incident was.”
How Training and Awareness are Reshaping Cyber Defense Strategies by Victoria Gayton
A great example of the effectiveness of education in addressing issues like social engineering. This aligns closely with principles in resilience engineering education, which aims to prepare systems and individuals to anticipate, withstand, and recover from adverse events. And it works!
Evolution SRE Google - Using STAMP to improve resilience in Google production systems
Google's Site Reliability Engineering (SRE) has evolved to address increasing system complexity by adopting the System-Theoretic Accident Model and Processes (STAMP) framework, which emphasizes understanding and managing complex system interactions over preventing individual component failures. STAMP applies control theory principles to safety engineering, viewing accidents not as a chain of events but as complex interactions between system components, including human operators and software. A key takeaway related to resilience is that by applying systems theory and control theory, SREs can better anticipate potential failures and design safer, more reliable systems from the ground up.
Organizations are distributed systems by Malte Ubl
Reflection from his experience leading teams at Google and Vercel.
If you are in crisis mode, avoid these three Corporate Anti-Patterns by Sophie Seiwald-Højer ❤️
Great post on the importance of empowering teams, streamlining communication, and fostering a blameless culture to enhance organizational resilience - enabling better adaptation and recovery during crises.
“Crises reveal the true strengths and weaknesses of an organization.“
How a Regular Developer Found a Passion for Incident Management—This Reddit thread made me smile, and it has some great suggestions for anyone starting with incident management.
Reliability Engineer Shares How Businesses Can Manage High Availability and Improve Resilience by Aremu Adebisi
Alexandr Hacicheant, Head of Reliability Engineering at Mayflower, shares his experiences and insights into how companies can enhance resilience.
“Service reliability math that every engineer should know […] while service reliability is often reduced to a simple percentage, the reality is far more nuanced than those decimal points suggest.” by Addy Osmani
“The easier you make it to deploy less code more frequently and more safely will massively reduce your company's risk when things do go wrong because you will be able to isolate any bad changes and rollback/fix forward as required.” by David Denton
How Steadybit Enhances Chaos Engineering with AWS FIS by Summer Lambert
Service reliability math that every engineer should know by Addy Osmani
This diagram started as a joke but like... we now literally have a queue in front of every one of our back-end services and lambdas now at Plain by Matt Vagni
Why Most Organizations Are Missing The True Meaning Of Cyber Resilience ❤️
Many organizations misunderstand cyber resilience, focusing only on known risks and neglecting the importance of preparing for unforeseen threats. Practices like chaos engineering can help identify weaknesses and enhance response mechanisms, thereby strengthening an organization's overall resilience.
Enhance the resilience of critical workloads by architecting with multiple AWS Regions by John Formonto
Announcing upcoming changes to the AWS Security Token Service global endpoint
AWS is updating the Security Token Service (STS) global endpoint to automatically route requests to the same region as your workloads, enhancing resiliency by reducing dependency on a single region and improving performance through localized request handling.
Podcasts
Observability: the present and future with Charity Majors ❤️
Great, great discussion on the state and evolution of observability. Charity talks about how traditional static dashboards are insufficient for modern software systems, advocating instead for dynamic, interactive tools that let engineers engage with and query their data to understand system behavior. She also talks about the importance of observability in the context of AI-generated code.
“SLOs provide a budget for teams to run chaos engineering experiments.”
I wrote about Amazon Search using SLOs to run chaos engineering experiments here.
Rethinking business continuity before the next big IT outage by Beth Pariseau
"In other engineering disciplines -- including aerospace engineering -- resilience engineering and business resilience are better understood than they are in IT, particularly as the focus for technologists has trended toward velocity in the cloud era, Betz said. "We in the [IT] industry are still babies at doing this," he said. "IT needs to start adopting some of these practices from domains that understand this a lot better."
The “this is fine!” podcast
Episode 7 - AI and Resilience with Courtney Nash
I really enjoyed listening to this episode on AI and Resilience. I am currently writing a piece on AI and resilience, but from the angle of convention over configuration, complexity, and adaptive capacity, so it was really great to hear Courtney discuss some of that. Highly recommended.
Video of the month
Try again: The tools and techniques behind resilient systems by Marc Brooker
Grand architectural theories are nice, but what makes systems resilient is in the details. Marc Brooker, VP, and distinguished engineer, looks at some of the resiliency tools and techniques AWS uses in its systems. Marc rethinks, retries, breaks open circuit breakers, decodes erasure coding, and tackles the tail. Learn about formal methods and simulation and how these tools help build faster code faster.
Try again: The tools and techniques behind resilient systems by Marc Brooker - from re:Invent 2024
Upcoming Conferences
Chaos Carnival 2025 https://www.papercall.io/chaoscarnival2025
Conf42.com Chaos Engineering 2025 https://www.papercall.io/conf42-chaos-engineering-2025
SREday New York City 2025 Q1 https://www.papercall.io/sreday-2025-nyc-q1
SREday London 2025 Q1 https://www.papercall.io/sreday-2025-london-q1
SREday San Francisco 2025 Q2 https://www.papercall.io/sreday-2025-san-francisco-q2
Conf42.com Site Reliability Engineering (SRE) 2025 https://www.papercall.io/conf42-site-reliability-engineering-sre-2025
Regional Scrum Gathering Dhaka 2025 https://www.papercall.io/dhakaregional
DevOpsDays Medellin 2025 https://www.papercall.io/dodmde2025
SREday Cologne 2025 Q2 https://www.papercall.io/sreday-2025-cologne-q2
Conf42.com Incident Management 2025 https://www.papercall.io/conf42-incident-management-2025