Resilience Bites #11 - What the Internet said last month!
May 5th, 2025
In this edition of Resilience Bites, I’ve curated a selection of noteworthy discussions and findings related to resilience from this past month - April 2025.
I hope you still find these highlights insightful and that they inspire you to explore the world of resilience further.
Happy reading!
Blogs highlights
(Un)coupling in distributed systems - Part 2 by Uwe Friedrichsen
This part 2 of Uwe’s series in which he looks at coupling in distributed systems, focusing on the "redundancy fallacy" and temporal coupling. Uwe explains that redundancy alone doesn't solve tight coupling issues and can introduce new failure modes. Temporal decoupling—separating request processing from external data access or decoupling requests from responses—offers more design flexibility and indirectly promotes functional decoupling, which is essential for distributed systems. Link to part 1.
The Competitive Edge Of Negative Capability: How Great Leaders Thrive In Uncertain Times by Michael Hudson
Another beautiful blog post from Michael. Here is a quote from the article: “When facing uncertainty or growing complexity, many leaders grasp instinctively for greater control through standardization and structure. I’ve seen this pattern emerge repeatedly. But a bias toward trying to reduce uncertainty frequently leads to premature conclusions, overlooked strategic opportunities, and rigid frameworks ill-equipped for adaptability.” This is spot on!! The instinct reaction to uncertainty is often control and rigidity, when it should be slack and giving space for adaptation and resilience. A must read!
Decomposing Transactional Systems by Alex Miller
Alex breaks down database transaction systems into four fundamental operations: execution (evaluating transaction content), ordering (assigning timestamps), validation (checking for conflicts), and persistence (making changes durable). He demonstrates how various database architectures, including FoundationDB, Spanner, TAPIR, Calvin, CURP, and TicToc, arrange these operations differently to achieve specific performance characteristics.
Decomposing Aurora DSQL by Marc Brooker
Marc reflects on Alex Miller's model of transaction systems which breaks them down into four functions: execution, ordering, validation, and persistence. Marc maps Aurora DSQL's architecture to this framework, explaining how it processes transactions across horizontally scalable PostgreSQL-powered query processors.
Good models protect us from bad models by Lorin Hochstein
Lorin argues that while resilience engineering insights might not provide specific actionable solutions to prevent incidents, they still offer value by providing better conceptual models. He suggests that humans inevitably use models to understand complex systems and incidents, and without good models (even if they're not directly actionable), we default to simplistic but incorrect models that feel more actionable. Hochstein draws a parallel to Andrew Gelman's observation that bad social science proliferates without good social science to counter it.
Model error by Lorin Hochstein
Lorin explores how software contains inherent models of the world, from database schemas to control systems. He emphasizes that all models are necessarily incomplete and wrong, noting examples like the oversimplified assumptions about human names in programming.
Safety-Organized Criticality by Tom Geraghty
Tom applies Self-Organised Criticality (SOC) to workplace safety, explaining how small, seemingly harmless adaptations and deviations in normal work gradually push systems toward critical thresholds where minor triggers can cause major incidents. This "Safety-Organised Criticality" parallels established safety theories and explains why incidents follow power-law distributions, with near-misses being frequent and severe injuries rare but emerging from the same underlying processes. This framework directly relates to resilience by highlighting why systems inevitably drift toward vulnerable states through normal operations, making resilience capabilities essential for detecting accumulating risks and adapting before they cascade into serious failures.
Impact, agency, and taste by Ben Kuhn
As Ben observes in his analysis of high-performing colleagues at Anthropic, the distinguishing factor isn't simply technical skill, but rather the capacity for high-leverage adaptation, finding and executing work that maximizes impact with minimal resources. Ben identifies two critical components that enable this adaptive capacity: agency (the initiative and resourcefulness to drive change) and taste (the intuition to select effective approaches).
Why Employees Stay Silent When They See Warning Signs of a Problem by Hyunsun Park and Subra Tangirala
The article examines how organizations handle ambiguous versus clear threats. Research shows that employees are more likely to remain silent about ambiguous threats, deferring to leadership instead of speaking up. This can leave organizations vulnerable when employee engagement is most needed.
Studying organisational cultures and their effects on safety by Ben Hutchinson
Hopkins looks at how organizational cultures (plural, not singular) impact safety, noting there's no consensus on what "safety culture" means. He discusses various research methods: perception surveys (which capture attitudes but may miss complex practices), ethnographic studies (richer but time-intensive), and major accident inquiries (which reveal cultural factors through inference).
What Makes Difficult Incidents So Difficult? by Hamed Silatani
In this post, Hamed unpack a few patterns that make some incidents harder than others, and how mismatched mental models between teams can slow everything down.
On describing, not explaining by Paige Cruz
Paige and her partner were watching The Bourne Identity when they heard a mysterious sound from above their living room. Initially, they jumped to various explanations (acorns falling, birds hitting windows) without success. Then the author remembered a principle from incident response: "Remember to Look for Descriptions, Not Explanations."
Antithesis driven-testing by Carl Sverre
"A system that can simulate the full chaos of the internet—flaky networks, machine crashes, race conditions—while simultaneously learning how to be the worst possible horde of users. A testing pattern that can give me confidence that I won’t be woken up at 3 AM for a production outage or discover I’ve silently been losing data.
That’s the bar Graft has to meet, and this is the story of my experience using Antithesis to test Graft. […]”
Slack's Migration to a Cellular Architecture by Cooper Bethea
Cooper Bethea explains the need for a cellular architecture at Slack, triggered by availability zone outages. He details the "before" & "after" of their production environment, emphasizing the strategic choices made for services with varying consistency requirements. Discover the success drivers, including incremental implementation and embracing "good enough," that enabled this complex migration.
Cultivating a Culture of Resilience in Software Organizations by Ben Linders
InfoQ interviewed Kathleen Vignos about cultivating a culture of resilience.
Operational resilience and stress-testing for "wartime" by Noah Bovenizer
The article discusses how the 2024 Crowdstrike-related IT outages served as a wake-up call for businesses to reassess their disaster recovery strategies.
Why Are All the Smart People So Bad at History? by Joan Westenberg
“But history isn’t a clean dataset. It doesn’t behave like a lab experiment. You can’t rerun the 20th century with a tweak and expect a stable result. Contingency is everything. People make choices. Systems react. Culture evolves. Timing matters. Change one node and the network shifts in ways you can't predict. There’s no control group for history. You don’t get to A/B test the Russian Revolution.”
Video of the week
The Search for Solutions, Trial & Error - Human-Powered Record & More (1979) | 16mm Film Scan
One episode of the mostly lost/forgotten nine part 1979 educational series, The Search for Solutions. Features a number of interesting (and not so interesting) subjects.
Fascinating story that demonstrates the value of experimentation, trial-and-error, and iterative improvement when dealing with complex challenges - key principles in building resilient systems.
Podcasts
Balancing Coupling in Software Design with Vlad Khononov
Thomas Betts speaks with Vlad Khononov about balancing coupling in software design, the subject of his recent book. They discuss how coupling is necessary for a system to function, but has to be balanced to allow the system to evolve. Vlad identifies three factors that can be used to measure coupling: knowledge sharing, distance, and volatility.
Taming Flaky Tests: Trisha Gee on Developer Productivity and Testing Best Practices with Trisha Gee
In this podcast, Shane Hastie, Lead Editor for Culture & Methods, spoke with Trisha Gee about the challenges and importance of addressing flaky tests, their impact on developer productivity and morale, best practices for testing, and broader concepts of measuring and improving developer productivity.
Resilience, Complexity, and Your Boss a collab with Punk Rock Safety
The podcast features Clint and Colette discussing resilience engineering with the hosts of the Punk Rock Safety podcast (Ben, Dave, and Ron), exploring how to build resilience in organizations where leaders have deterministic mindsets.
DORA Community Discussion - Resilience Engineering with Colette Alexander
Colette Alexander presented on the principles of resilience engineering, its historical context, and practical applications, emphasizing the importance of deep incident analysis and critical examination of standard practices.
Briefs from the social web
Katerina Trajchevska - “ If quality is someone else’s job, it’s everyone’s problem.
At GitLab, there’s no traditional QA team.
Developers are fully responsible for testing and quality - from writing automated tests to defining test cases and verifying features.
A small QA team supports them, but they don’t block releases. They coach. […]”
Pedro Gil Carvalho - “It’s insane to me that I have to trawl through jira, slack, confluence, github, datadog and incident io just to understand what’s being done about a bug that caused an outage - or pull a bunch of people out of focus for a sync status update. […]”
Michael Hudson - “Traditional resilience training often falls short because it:
- Prioritizes reflection over action;
- Ignores the need for collective resilience;
- Demands time/resources managers don’t have.
What’s needed is micro-development—small, intentional growth moments built into daily work […]”.
Vera Cherepanova - “Why do employees stay silent in the face of ambiguous threats? Because unclear risks are mentally exhausting—and traditional structures push decision-making up the chain. But waiting for leadership to notice the warning signs? That's risky.[…]”
James Pomeroy - “It's 20 years next week since the Amagasaki train crash, one of Japan’s worst rail disasters. What can Amagasaki teach us about the cultural forces in disasters? […]”
Vas Grygorovych - “The CEO of NVIDIA doesn’t like firing people.
“I used to clean bathrooms, and now I’m the CEO of one of the biggest companies in the world” - says Jensen Huang.
It’s a reminder that people can grow in ways you never expect if you actually give them the chance. […]”
Hamed Silatani - “I used to be an advocate for assigning incident severity.
Now… I’m not so sure it serves a purpose for us.
I dig into the reasons why and the context behind my shift in perspective in this article—but I’d love to hear yours: What value do you see in assigning severity levels to incidents?[…] ”
Courtney Nash - “I'm going to have to write more about this in detail, but for now: Dear CNBC, what in the actual &%#* were you thinking when framing your article (and the headline) this way?”
Isaiah Olson - “Your team doesn’t need a 10x dev — it needs a 1x culture.
Chasing “10x engineers” is a trap.
One developer pulling heroic all-nighters, building bespoke tools, and solving problems no one else understands might feel like velocity — until they leave.
Then what?”
Amy Edmondson - “I call "failing well" a science for a reason… because it's much more than a catchy term.
Failing well describes the systematic approach to understanding our failures: how we think about, talk about, and learn from them. The result? A healthier relationship to failure and an improved ability to pursue intelligent risks.”
Jay Gengelbach - “It took me 10 years of building infrastructure at Google to learn this uncomfortable truth: nobody cares about your infrastructure.[…]”
iluminr - “You need to come out of a scenario sweating…because if you haven’t been through the wringer in rehearsal, you won’t know how to respond when it’s real. […]”
Hamed Silatani - “Ever been in an incident where no one’s quite speaking the same language?
You ask for help, but the response is more questions. Time drags on. There is no easy answer… […]”
Carlos Arguelles - “"This line [of code you added a decade ago] saved me hours of debugging today" - a slack I received yesterday from Adithya Venkatesh made my day. […]”
Paula Fontana - “It started with salad and ended in the multiverse.
Just a casual dinner with my daughter - until we realized we weren’t having the same conversation.
We were speaking the same language, but from different realities. […]”
Alex Hidalgo - “I'm not big on famous quotes, but I do think about this one from Robert Watson-Watt a lot.
"Give them the third-best to go on with; the second-best comes too late; the best never comes.
This is essentially what SLOs are all about. […]"
Chris Konrad - “Most organizations focus on “What’s the likelihood this happens?”
But for critical infrastructure — the real question is: 👉 What happens WHEN it does?”
Upcoming Conferences
IaCConf 2025 - IaCConf - The First Community-Driven IaC Conference | May 15, 2025
SREday Amsterdam 2025 Q2 https://www.papercall.io/sreday-amsterdam-2025-q2
DevOpsDays Medellin 2025 https://www.papercall.io/dodmde2025
SREday Cologne 2025 Q2 https://www.papercall.io/sreday-2025-cologne-q2
SREday London 2025 Q3 https://www.papercall.io/sreday-2025-london-q3
Conf42.com Incident Management 2025 https://www.papercall.io/conf42-incident-management-2025
SREday Amsterdam 2025 Q4 https://www.papercall.io/sreday-amsterdam-2025-q4
Conf42.com DevSecOps 2025 https://www.papercall.io/conf42-devsecops-2025
SREday London monthly MEETUP - ongoing CFP https://www.papercall.io/sreday-london-meetup
Site Reliability Engineering NL Meetup https://www.papercall.io/site-reliability-engineering-nl
5th International Conference on Intelligent Computing & Optimization 2022 https://www.papercall.io/our-event
DevOps and AI Latvia meetups https://www.papercall.io/devops-talk-latvia