Resilience Bites #18 - What the Internet said this summer!

Resilience Bites

Aug 25

In this summer edition of Resilience Bites, I've gathered some great reads and podcasts from July and August 2025 related to resilience.

Summer is really important to me here in Finland. After our long winter, we only get a few months of sun, so when it comes out, we spend all our time outdoors. This summer I was busy working in my food forest and garden, planting, harvesting, all that good stuff. I was also working with my amazing customers on their resilience journeys. And my mum visited so I wanted to spend quality time with her.

All that to say that I ended up with a pretty long list of articles and podcasts that piled up while I was focused on other things! But that's not necessarily bad; sometimes the best content finds you when you're not actively looking for it.

I hope you find these resources useful and that they get you thinking more about resilience. You can also follow the #ResilienceBites hashtag on LinkedIn to see posts I share throughout the month.

Happy reading!

Blogs highlights

The problems that accountability can't fix by Lorin Hochstein
Argues that accountability fails to solve coordination problems between teams and situations where leaders have fundamentally flawed risk models, using the OceanGate submarine disaster as an example.

Dynamo, DynamoDB, and Aurora DSQL by Marc Brooker
Compares the architectural evolution from Amazon's original Dynamo paper to modern DynamoDB and Aurora DSQL, focusing on how each system achieves durability, consistency, and availability.

Controls vs Guardrails: Why Organizations Struggle with Resilience Despite Having All the Right Pieces by Adrian Hornsby
Argues that organizations struggle with resilience because they confuse rigid controls (which create friction during normal operations) with adaptive guardrails (which activate only when approaching danger).

Quicksilver v2: evolution of a globally distributed key-value store (Part 1) by Anton Dort-Golts and Marten van de Sanden
Describes how Cloudflare evolved their global key-value store from storing full datasets everywhere to using proxy-replica architecture to manage disk space constraints.

Quicksilver v2: evolution of a globally distributed key-value store (Part 2) by Anton Dort-Golts and Marten van de Sanden
Explains Cloudflare's three-tier caching architecture (local cache, data center-wide sharded cache, and storage replicas) to handle massive scale while managing disk space efficiently.

Initiatives Focusing on Things That Go Right in JR East
Case study examining how focusing on successful operations rather than just failures builds organizational resilience and learning capacity.

Learn from AWS Fault Injection Service team's approach to Game Days by Iris Sheu and Edin Kozo
Details AWS FIS team's structured approach to running game days for testing incident response, including preparation, execution, and post-game analysis to strengthen operational resilience.

Commentary on the Historical Contributions of Dr. Richard Cook's Realized Predictions of the Impact of New Technology on Complex Cognitive Work as Viewed Through the Lens of the Theory of Graceful Extensibility
Reflects on foundational resilience engineering insights about how technology changes affect complex system adaptability and human cognitive performance.

A Primer on Recognition Primed Decision-Making (RPD) by Jared Peterson
Explains how experts make decisions through pattern recognition and mental simulation rather than comparing options, challenging traditional decision-making models.

Introduction to Resilience Engineering by Michelle Casey
Comprehensive introduction to resilience engineering concepts, contrasting Safety-I (reducing failures) with Safety-II (understanding how things normally work) and reframing human error.

How AI Can Degrade Human Performance in High-Stakes Settings by Dane A. Morey, Mike Rayo, and David Woods
Research showing that while good AI can improve human performance, poor AI predictions can degrade human performance in safety-critical settings.

Negotiating the Paradox We Face in Resilience Engineering—Lessons From an Engineering Leader by Michelle Casey
Explores the tensions between efficiency and resilience in engineering practice, offering leadership perspectives on balancing competing priorities.

The Amagasaki Disaster by Tom Geraghty
Analyzes the 2005 Japanese train derailment that killed 107 people, exploring how production pressure and punishment culture created conditions where unsafe practices became normalized.

Cloudflare and the infinite sadness of migrations by Lorin Hochstein
Uses Cloudflare's DNS resolver outage to illustrate the inherent reliability risks of system migrations and the challenges of running old and new systems concurrently.

How AI support can go wrong in safety-critical settings by Emily Caldwell
Reports research showing that inaccurate AI predictions can cause dramatic degradation in human decision-making performance, even when accompanied by explanatory data that contradicts the AI.

Why MTTR is a Misleading Metric (And What to Track Instead) by Adrian Hornsby
Explains why Mean Time to Recovery is mathematically flawed for incident data and proposes alternative metrics like percentiles and impact-focused measurements.

Podcasts

Uptime Labs and the Multi-Party Dilemma (Part I)
Seasoned incident responders discuss a simulated drill exploring the multi-party dilemma—the challenge of coordinating incident response across teams with different missions and incentives.

Uptime Labs and the Multi-Party Dilemma (Part II)
Continues analyzing the incident drill, focusing on how team behavior changes under stress and how unspoken assumptions get tested during high-pressure situations..

Scaling Correctness: Marc Brooker on a Decade of Formal Methods at AWS
Marc gives us the inside story on AWS's decade-long journey with formal methods—powerful techniques for verifying software correctness.

Reframing Safety Debt – E. Asher Balkin
This presentation introduces the concept of "safety debt"—accumulated risks from postponing safety measures for short-term gains like revenue or production targets. It explores four consequences: trade-off effects, deferred safety controls, compounding costs over time, and hidden vulnerabilities that accumulate like financial debt.

Intro to Resilience Engineering with Michelle Casey (Episode 101)
Foundational introduction to resilience engineering principles

Beth Adele Long on reliability and leadership
Examines the intersection of leadership practices and system reliability, emphasizing human factors in resilient operations.

Upcoming Conferences

TechSummit 2025 | Building Resiliency at Scale https://techsummit.io/call-for-papers/

SREday London 2025 Q3 https://www.papercall.io/sreday-2025-london-q3
DevOps Days Brasília 2025 https://www.papercall.io/devopsdaysbsb2025
Conf42.com Incident Management 2025 https://www.papercall.io/conf42-incident-management-2025
SREday San Francisco 2025 Q4 https://www.papercall.io/sreday-2025-san-francisco-q4
SREday Chennai 2025 Q4 https://www.papercall.io/sreday-2025-chennai-q4
SREday Cologne 2025 Q4 https://www.papercall.io/sreday-2025-cologne-q4
Africa DevOps Summit 2.0 https://www.papercall.io/adsscummit25
SREday Amsterdam 2025 Q4 https://www.papercall.io/sreday-amsterdam-2025-q4
SREday Paris 2025 Q4 https://www.papercall.io/sreday-paris-2025-q4
Conf42.com DevSecOps 2025 https://www.papercall.io/conf42-devsecops-2025
SREday Campinas 2025 Q4 https://www.papercall.io/sreday-campinas-2025-q4

resiliencechaos engineeringengineeringcloud computingawsresilience bitessoftware systemssoftware engineering

Adrian Hornsby

Resilience Bites #18 - What the Internet said this summer!

Blogs highlights

Podcasts

Upcoming Conferences

Resilience Bites #17- LinkedIn Rewind (week 27-28)

adhorn.me