Resilience Bites #3 - What the Internet said last month!
Feb 28, 2025
In this February edition of Resilience Bites, I’ve curated a selection of noteworthy discussions and findings related to resilience from this past month.
I hope you find these highlights insightful and that they inspire you to explore the world of resilience engineering further.
My 'coups de cœur' have one ❤️ or two next to it.
Happy reading!
Blogs highlights
How Monzo Bank Built a Cost-Effective, Unorthodox Backup System to Ensure Resilient Banking by Eran Stiller ❤️
Such a great example of heterogeneous redundancy. It is not a new idea, but seeing it implemented and used in real is awesome, so thanks for sharing.
Here is a great paper on the topic applied to circuit boards. The principle is the same, though.
Chaos Engineering in the Age of AI: Surfacing Hidden Complexity by Adrian Hornsby
“The rise of AI in software development presents a fascinating paradox. While AI tools make it easier than ever to generate complex systems rapidly, they also make it harder to understand how these systems actually work.“
MTTR Is (Still) Lying to You ❤️ by Courtney Nash
“Software organizations tend to value measurement, iteration, and improvement based on data. These are great things for an organization to focus on; however, this has led to an industry practice of calculating and tracking Mean Time to Resolve, or MTTR. While it’s understandable to want to have a clear metric for tracking incident resolution, MTTR is problematic for a number of reasons. “
When AI Makes the Call - Questions About Meta-Operators and System Responsibility by Adrian Hornsby
“As AI increasingly helps us build complex software systems, a new type of tool is emerging: AI meta-operators. Meta-operators are AI agents designed to supervise and manage other AI systems and software. “
Best Simple System for Now ❤️by Daniel Terhorst-North
“The Best Simple System for Now is the simplest system that meets the needs of the product right now, written to an appropriate standard. It has no extraneous or over-engineered code, and any code it does have is exactly as robust and reliable as it needs to be, neither more nor less.”
Investing in chaos engineering as a strategic necessity ❤️❤️ by Adrian Hornsby
I am really proud of that one. It is long but it is the culmination of a decade worth of chaos engineering.
DORA scenario testing with AWS Fault Injection Service ❤️ by Haresh Nandwani and Adrian Hornsby
“This blog post outlines how you can use AWS Fault Injection Service (FIS) to support the DORA requirements around scenario-based testing through a structured, iterative process of identifying failure scenarios, planning and executing chaos engineering experiments, reporting on the results, and using the information learned to improve operational resilience.”
Most Companies Experience Weekly Outages: The State of Resilience 2025 Report by Rafal Gancarz
“According to The State of Resilience 2025 Report, published by Cockroach Labs, outages are commonplace in most organizations, with 55% of companies reporting weekly and 14% reporting daily outages. Staggering 100% of survey participants experienced revenue losses due to outages, with some companies (8%) reporting losses of USD 1 million or higher over the last 12 months.”
How Locking, Saturation and CDN Network Issues Brought down Canva ❤️ by Renato Losio
“The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.“
Restrict Mutability of State: When it is not necessary to change, it is necessary not to change by Kevlin Henney
“What appears at first to be a trivial observation turns out to be a subtly important one: a great many software defects arise from the (incorrect) modification of state. It follows from this that if there is less opportunity for code to change state, there will be fewer defects that arise from state change!”
You’re missing your near misses ❤️ by Lorin Hochstein
“Because most of our incidents are novel, and because near misses are a source of insight about novel future incidents, if we are serious about wanting to improve reliability, we should be treating our near misses as first-class entities, the way we do with incidents.”
Resilience: some key ingredients ❤️by Lorin Hochstein
“Brian Marick posted on Mastodon the other day about resilience in the context of governmental efficiency. Reading that inspired me to write about some more general observations about resilience.”
Briefs from the social web
Tudor Girba - “If you denote your system as legacy, chances are you perceive it as an impediment from a business point of view. Well, your legacy can become an opportunity.”
Ilya Bezdelev - “After the massive S3 outage in 2017, the COE showed that an engineer made a typo in a command that removed thousands of instances, causing half of the internet to go down. In the aftermath of the incident, all of us at AWS received a memo that effective immediately, teams must review their processes and systems and automate everything that they can automate. If a command has to be run manually, it has to be run by two people. It was dubbed the "2P rule" jokingly referred to as a "to pee" rule, as in "I need you 2P with me…”
Laurent Domb - “I am pleased to share that we've added new experiment capabilities to run network faults in ECS Fargate environments as part of our AWS Chaos Engineering Workshop using the AWS native Fault Injection Service!”
Sam Newman - “This is a facinating writeup by Eran Stiller over at InfoQ of Monzo Bank's approach to ensuring that critical banking functionality is still available even in the face of major outage: https://lnkd.in/d545hsMJ. Thanks also to Monzo's Daniel Chatfield for sharing his thoughts with Eran, to make this such an insightful read.”
Matthias Patzak - "You build it, you run it" isn't just a catchy phrase.
It's the core principle that transforms team accountability and software quality…”
Jonathan Courtney - “Most people want to sit around and criticise without ever putting themselves out there. Those people rarely get what they want. They don't find customers, people don't use their products.”
Lee Hannigan - “Imagine launching a globally replicated database without worrying about infrastructure, failover, or operational overhead. With Amazon DynamoDB, it’s as simple as this.“
Mike Rayo, PhD - “ the most important ingredient to beginning and sustaining a New Look/New View/Safety II/Safety Differently/Capability-based/Adaptation-based aspect for your Systems Performance […] is understanding and supporting what their people are doing RIGHT NOW to keep the system running, and keep it running safely. “
Stephen Whitworth - “An investor asked me recently: "what will the impact of tools like Cursor, Lovable, etc, be on software reliability? Good question. Here's what I think will happen.”
Jay Gengelbach - “Terms every engineer should know: Chesterton's Fence …”
Joao Neto - “When something breaks, the first reaction is often: "Why didn’t we test for that? We need more tests!" This has always boggled my mind. Testing is seen as the primary way to achieve quality, but adding more tests rarely addresses the root cause - it just patches the symptoms.”
itronitron - “Improving efficiency in a system generally reduces it's resilience, so improving efficiency is definitely not always a good thing.”
ColinWright - “There's an important point in here: "Waste" isn't always waste ... sometimes it's built-in resilience. I remember working on a communications system where management were hell-bent on squeezing every last iota from the capacity. We compressed, we coordinated, we worked on making sure the system had no "waste". Then when someone inadvertently sent something they didn't expect, the entire system ground to a halt, unable to adapt, unable to limp along until it got sorted. Spare capacity is essential, and only looks like waste to those who are insufficiently grounded in relevant engineering principles.”
Podcasts
The VOID podcast
Episode 7: When Uptime Met Downtime
“We took a bit of a hiatus from recording last year, but we're back with an episode that I think everyone is really going to enjoy. Late last year, John Allspaw told me about this new company called Uptime Labs. They simulate software incidents, giving people a safe and constructive environment in which to experience incidents, practice what response is like, and bring what they learn back to their own organizations.”
Episode 8: A Tale of a Near Miss (upcoming)
The “this is fine!” podcast
Episode 9 Learning from Incidents with special guest Alex Elman ❤️
I really enjoyed this one. Alex’s humble and inclusive tone was refreshing.
Episode 10 - When They go Full ITIL on You w/special guest John Allspaw ❤️
Video of the month
GitOps Best Practices Every DevOps Team Should Follow in 2025 by Christian Hernandez ❤️❤️
This GitOps presentation by Christian Hernandez is exceptional and worth every second of your time. Even seasoned operators will get some value out of it!
GitOps Best Practices Every DevOps Team Should Follow in 2025
Upcoming Conferences
SREday London 2025 Q1 https://www.papercall.io/sreday-2025-london-q1
SREday San Francisco 2025 Q2 https://www.papercall.io/sreday-2025-san-francisco-q2
SREday Redmond 2025 Q2 https://www.papercall.io/sreday-2025-redmond-q2
Conf42.com Site Reliability Engineering (SRE) 2025 https://www.papercall.io/conf42-site-reliability-engineering-sre-2025
Regional Scrum Gathering Dhaka 2025 https://www.papercall.io/dhakaregional
DevOpsDays Medellin 2025 https://www.papercall.io/dodmde2025
SREday Cologne 2025 Q2 https://www.papercall.io/sreday-2025-cologne-q2
Conf42.com Incident Management 2025 https://www.papercall.io/conf42-incident-management-2025
Conf42.com DevSecOps 2025 https://www.papercall.io/conf42-devsecops-2025
SREday London monthly MEETUP - ongoing CFP https://www.papercall.io/sreday-london-meetup