Resilience Bites #14 - What the Internet said last month!


In this edition of Resilience Bites, I’ve curated a selection of great reads related to resilience from this past month, May 2025.

I have also added a few of the posts that I wrote for the Resilium Labs blog.

I hope you still find these interesting and that they inspire you to explore the world of resilience further.

Happy reading!

Blogs highlights

The Silicon Valley Way: Move fast and break…aviation safety? by David Woods, Mike Rayo, Shawn Pruchnicki

Brilliant analysis of how Silicon Valley's "move fast and break things" philosophy threatens aviation safety. Woods and colleagues show how aviation achieved ultra-safety through proactive safety engineering, creating foresight about changing risks before anyone gets harmed, which directly conflicts with the break-things-and-learn mindset. The SpaceX examples are devastating: their rockets explode, but only air traffic controllers' adaptive expertise prevents a wider catastrophe. The piece argues for resilience engineering approaches that can balance speed and thoroughness, citing NASA's Engineering and Safety Center as a model for learning across organizational boundaries. Must read.

Analyzing Metastable Failures by Rebecca Isaacs, Peter Alvaro, Rupak Majumdar, Kiran-Kumar Muniswamy-Reddy, Mahmoud Salamati, Sadegh Soudjani

This paper presents a toolkit for predicting when systems are vulnerable to metastable failures - situations where a system gets stuck in a degraded state even after the initial problem is resolved. The authors use an integrated approach from mathematical models to real stress testing, showing how engineers can identify the "tipping points" where their systems become vulnerable. Essential reading for understanding how cascading failures happen and how to prevent them before they occur.

Too Soon or Too Late: The Incident Escalation Dilemma by Hamed Silatani

This tackles one of the most frustrating parts of incident response - figuring out when to actually call for help. Hamed argues that escalation timing isn't a solvable problem because it's always subjective, but reframes the question nicely: instead of asking "when is the right time?" ask "how do people decide to recruit help?" He breaks down the psychological, cultural, and organizational factors that influence these decisions, then offers practical strategies like simple rules ("if you don't know how to fix it in 20 minutes, get help") and building trust through relationships. While my take is simpler (escalate early and often, always, and make sure that message comes directly from leadership) it is still a good read.

Root Cause Analysis: PostgreSQL MultiXact member exhaustion incidents (May 2025) by the Metronome Engineering Team

This is a great example of technical transparency after a major outage. Metronome experienced four multi-hour write outages over a week due to PostgreSQL's obscure MultiXact member space exhaustion, a limit that's completely invisible to standard monitoring but can bring down massive databases. The team thought they had plenty of headroom based on their MultiXact ID metrics (50% utilization), but were actually hitting a different, undocumented limit entirely. Deep technical detail, honest admission of their misunderstanding, and they kept hitting the same issue because the root cause was so poorly understood. A good example of incident communication and learning from complex system failures.

The Quiet Erosion: How Organizations Drift Into Failure by Adrian Hornsby

This post talks about a fictional e-commerce company called TrendCart to illustrate how organizations gradually drift from safe practices into failure through thousands of small, rational compromises. I talk about the invisible accumulation of risk, reducing test coverage for "non-critical" features, bypassing code reviews for "urgent" fixes, and accepting minor bugs as "business reality." Major incidents aren't caused by single catastrophic decisions but by the slow normalization of deviance. I also provide a practical framework for detecting and preventing drift before it's too late.

The Prevention Paradox: Why Successful Resilience Work Becomes Its Own Enemy by Adrian Hornsby

This post takes on why successful resilience teams get cut just when they're working best. A team prevents major outages for two years, then gets "rightsized" because leadership can't see the value of disasters that never happened. Recommended read if you've ever struggled to justify prevention work or watched good reliability practices get dismantled during budget cuts. I also provide a practical framework for making invisible work visible and calculating prevention ROI.

When Every Backup Fails: How a Small Contractor Re-Wrote Its BCP on the Fly after the 2011 Tōhoku Quake by Kosuke Nakazawa

Fascinating case study of adaptive resilience in action. When the 2011 tsunami made all their planned backup sites unusable, this small construction company set up an outdoor emergency operations center in their parking lot and resumed critical operations within an hour. Brilliant example of how real resilience isn't just about having backup plans, but developing the capability to create new plans when the old ones fail. The improvised "Energy Procurement Team" to solve fuel shortages shows genuine adaptive capacity under pressure.

Beyond Root Cause: A Better Approach to Understanding Complex System Failures by Adrian Hornsby

Post arguing against the "5 Whys" and traditional root cause analysis. I explain how these linear frameworks miss systemic issues by forcing complex failures into simple cause-and-effect stories. Recommended read if you've ever felt frustrated by post-mortems that blame individuals instead of examining system conditions. The "Trojan Horse" approach for gradually shifting teams from "root cause" to "contributing factors" is particularly important, in my opinion, to institute change with minimal resistance.

Not causal chains, but interactions and adaptations by Lorin Hochstein

This is an excellent deep dive contrasting traditional root cause analysis with resilience engineering thinking. Lorin shows how RCA's linear "domino effect" model misses the reality that complex systems fail through unexpected interactions, not causal chains. It is highly recommended if you want to understand why the "find the root cause and fix it" approach often fails. Instead of eliminating fault generators, focus on understanding and strengthening your system's adaptive capacity to work around the latent failures that are always present.

When a bad analysis is worse than none at all by Lorin Hochstein

Brilliant use of the double-slit experiment to explain why root cause analysis can be actively harmful. Lorin explains how a technician who records only "which slit the electron went through" instead of the full intensity pattern would completely miss the wave nature of matter. Similarly, RCA forces complex incidents into simple cause-and-effect stories, discarding the messy data that reveals how systems actually fail. Highly recommended if you've ever felt that traditional post-mortems miss the real complexity of what happened.

Good Performance for Bad Days by Marc Brooker

Sharp critique of how performance evaluation misses the most important scenarios. Marc argues that academics focus too much on happy-path performance and ignore what happens when systems hit overload, which is exactly when real failures occur. Excellent piece on why understanding saturation behavior matters more than peak throughput numbers. Highly recommended if you've ever wondered why systems that benchmark beautifully still fail catastrophically under load. The connection to metastable failures and coordinated omission is particularly valuable.

What Directors Are Thinking: Curtis Stephens - Resilience in the Boardroom by Curtis Stephens

Good take on board-level resilience thinking from a practicing director. Curtis emphasizes that resilient boards don't just survive crises but find opportunities within disruptions. Good practical advice on premortem frameworks, crisis management plans, and avoiding impulsive decisions under pressure. Recommended if you're interested in how resilience concepts translate to governance and strategic oversight. The focus on "building resilience muscle" through scenario planning is particularly valuable for board members.

Systems Correctness Practices at Amazon Web Services by Marc Brooker and Ankush Desai

Great overview of how AWS applies formal methods at scale, from TLA+ to the P programming language, property-based testing, and deterministic simulation. Excellent piece showing how theoretical formal methods translate into practical engineering benefits like performance optimizations and early bug detection. Highly recommended if you want to understand how a major cloud provider actually uses formal verification in production. The section on metastable failures and the connection between formal specs and testing oracles is particularly valuable.

When Learning Goes Underground by Alex Nauda

I learned a new term today in the reliability space... Incididnt (n.) - An internal event, including investigation and a retrospective, invisible to management and other teams, in which an ops team or service owner does not declare an incident officially, so as to avoid burdensome paperwork and/or blameful scrutiny, but still wants to capture ✨valuable learnings✨and fix underlying issues"

We found risks after they exploded! by Tom Stena

Smart, practical post on proactive risk management. Tom shares how they shifted from reactive crisis management to early warning systems using premortems, risk ownership, red-flag rules, and regular check-ins. Recommended if you're tired of being blindsided by "surprises" that had warning signs. The emphasis on psychological safety ("I'll never punish someone for bringing up a risk") is important for getting early signals from teams.

"Genuine question for my fellow risk management geeks: What is the actual point of the inherent risk assessment? [...] Personally, I don't believe organisations are ever truly exposed to inherent risk because something is always being done to manage the risk, even if it hasn't been properly thought about or officially documented yet." by Thomas White

Chaos Engineering for AI Systems by Jenn Bergstrom

I really love what Jenn Bergstrom is doing. I am no AI expert, but it is fascinating to see where this is going. It will be particularly interesting to see how, dare I say, traditional chaos engineering tools evolve to support the AI use case.'When something goes wrong, do our monitoring systems catch it, do the alerts fire properly, and can we recover gracefully?' That's the heart of chaos engineering right there.The AI space definitely needs this kind of resilience thinking, especially as these systems become more critical to operations. Looking forward to seeing how this develops

Resilience Maturity Assessment (ReMA) tool by the UN Office for Disaster Risk Reduction

The UN Office for Disaster Risk Reduction just launched its free Resilience Maturity Assessment (ReMA) tool, developed with major corporations like Nestlé and KPMG. While it's encouraging to see resilience moving into mainstream organizational practice, the tool focuses primarily on traditional business continuity elements like policies, governance structures, and resource allocation, rather than the adaptive capabilities central to resilience engineering. I'm currently developing a complementary assessment framework based on Erik Hollnagel's four abilities (Monitor, Anticipate, Respond, Learn) that I'll be sharing soon.

Understanding Adaptive Business Continuity: Measuring Capabilities by Mark Armour

The organizations that recover best from crises don't follow plans—they adapt. Drawing lessons from WWII paratroopers to modern corporate disasters, this post makes a compelling case for why capability matters more than compliance.

Designing Resilient Event-Driven Systems at Scale by Rajesh Kumar Pandey

This article shows how to build event-driven systems on AWS that handle real-world chaos like traffic spikes and cascading failures. Rajesh focuses on practical patterns like shuffle sharding and fail-fast strategies, emphasizing that resilience isn't about avoiding failure but designing systems that degrade gracefully and recover automatically. Great example of resilience principles applied to technical architecture.


Podcasts

Canva and the Thundering Herd - The VOID Podcast with Courtney Nash (The VOID) & Simon Newton (Canva)

Excellent deep dive into Canva's first public incident report featuring Simon Newton. The perfect storm involved CloudFlare network issues, a 20-minute origin fetch, and a known performance bug in their API gateway that had a fix waiting to deploy. What makes this brilliant is how they handled the response, freezing auto-scalers to prevent automation from making things worse (it often happens), using country-level blocks to control load, and bringing regions back systematically. The discussion of their IC process, vendor escalation procedures, and emphasis on practicing emergency controls in normal operations is great!

The One With Data Centers and Peter Pellerzi - SRE Prodcast with Steve McGhee, Matt Siegler (Google SRE) & Peter Pellerzi (Google)

Brilliant conversation about resilience at a very large physical scale. Peter shares how Google handled the Chile country-wide power outage through community response and adaptive planning, not rigid procedures but "cooperative adaptation" with global teams jumping in to help. The fuel truck testing story is great; they thought they understood refueling logistics until they actually tried it and discovered all the real-world constraints they'd missed. The emphasis on real-world testing, community support, and "things will break, how do you deal with it?" perfectly captures operational resilience thinking. Highly recommended.


Upcoming Conferences


Next
Next

Resilience Bites #13 - LinkedIn Rewind (week 20)