Resilience Bites #8 - What the Internet said last month!

Mar 31, 2025


In this March edition of Resilience Bites, I’ve curated a selection of noteworthy discussions and findings related to resilience from this past month. And there is a lot of it this month!!

I hope you still find these highlights insightful and that they inspire you to explore the world of resilience further.

Happy reading!


Blogs highlights

  • Failing Forward: Cultivating a culture that celebrates mistakes by Rohit Hasteer

    • The article discusses how business leaders can create environments where mistakes are viewed as learning opportunities. It emphasizes that organizations with psychological safety around failure can innovate more effectively, as team members feel empowered to take calculated risks. The author shares personal leadership experiences about transforming mistake-avoidance cultures into ones that "fail forward" by celebrating lessons learned and using setbacks as stepping stones toward better solutions.

  • The Long and Winding Road Towards Resilience - Part 10 by Uwe Friedrichsen

    • The final post in his fantastic series on resilience! Uwe examines whether organizations must always reach "advanced resilience" or can stop at interim plateaus. He argues that the appropriate destination depends on your needs—stability suffices for non-critical systems, robustness works for enterprise systems, basic resilience is necessary for safety-critical contexts, while advanced resilience prepares companies for VUCA environments.

  • (Un)coupling in Distributed Systems - Part 1 by Uwe Friedrichsen

    • This article examines coupling in distributed systems, focusing on what's needed for truly loose coupling between processes like microservices. While technical coupling (using asynchronous messaging instead of synchronous request-response) is commonly emphasized, the author argues functional coupling is more critical.

  • Gamechangers in Resilience: Leadership's Fragile Balance with Mark Heywood

    • This interview with Mark Heywood, a crisis management expert with a creative background, explores effective board leadership during crises. Heywood describes his unique journey blending structured corporate risk management with creative storytelling, which gives him valuable perspective on crisis response. He identifies five behaviors of impactful boards: clear role definition, emotional steadying, big-picture focus, decisiveness, and consistent communication. Common board frustrations include excessive debate, micromanagement, internal conflicts, and misaligned messaging. Even well-prepared boards face limitations including emotional bias, over-reliance on plans, and decision-making delays. Heywood's biggest concern beyond operational risks is leadership complacency—the subtle false security that develops when things are going well, leaving organizations vulnerable to disruption.

  • The Hidden Cost Of Perfectionism: Why Organizations Need Productive Failure by Michael Hudson

    • Organizations pursuing perfectionism inadvertently stifle innovation by creating low tolerance for risk and failure. Amy Edmondson advocates distinguishing between preventable failures (to be minimized) and intelligent failures (valuable for learning). Companies like Microsoft and Pixar have thrived by adopting "learn-it-all" mindsets that embrace productive failures. Leaders can foster this culture by building psychological safety, implementing learning systems, modeling vulnerability, and celebrating intelligent pivots. This approach enables organizations to minimize preventable errors while extracting maximum learning from necessary experiments.

  • In S3 simplicity is table stakes by Andy Warfield

    • AWS S3 turned 19 on Pi Day (March 14), evolving from its 2006 launch to now storing hundreds of trillions of objects across 36 regions. The S3 team has focused on making storage "simple" by eliminating distractions like capacity planning while responding directly to customer feedback. Recent improvements include removing the 100-bucket limit, implementing strong consistency, and introducing S3 Tables to provide native support for tabular data previously managed through formats like Apache Parquet and Iceberg. Throughout its evolution, S3's success has come from balancing the tension between simplicity and velocity while continuously adapting to how customers use their data.

  • What Progress In Learning From Incidents Actually Looks Like by John Allspaw

    • In his keynote at the Learning From Incidents conference, John Allspaw shared Indeed's success in building a culture that learns from incidents. Unlike many organizations that treat incident analysis as paperwork, Indeed created conditions where learning spreads throughout the organization. Their approach centers on building "the richest understanding of an event for the broadest possible audience". Indeed's group review meetings regularly attract participants from Marketing, Finance, Sales, and other non-technical departments, with attendees consistently rating these sessions as valuable. The company achieved this by building specialized interviewing skills (challenging for engineers), demonstrating what "different looks like" to shift expectations, and focusing on compelling storytelling rather than dry reports. This exemplar case shows that with dedicated analysts, leadership support, and patience, organizations can transform how they learn from incidents.

  • Antithesis - the last word in autonomous software testing discussion with Will Wilson

    • Antithesis offers a very different approach to software testing by running applications in a deterministic hypervisor and systematically searching for ways to break code. Unlike traditional testing where developers write specific test cases, Antithesis actively explores the entire "state space" of possible behaviors to catch rare bugs that standard testing misses. The company's technology originated from FoundationDB's sophisticated testing system, which enabled them to build a seemingly "impossible" distributed database. The platform makes non-deterministic software behave deterministically—controlling thread scheduling, network responses, and other typically random elements—allowing bugs to be precisely reproduced. While currently focused on enterprise clients, Antithesis plans to expand beyond distributed systems testing to websites, mobile apps, and games, with potential synergies with AI-driven development tools for automated bug detection and fixing.

  • Build multi-Region resilient Apache Kafka applications with identical topic names using Amazon MSK and Amazon MSK Replicator by Subham Rakshit

    • This post explains how to use MSK Replicator for cross-cluster data replication and details the failover and failback processes while keeping the same topic name across Regions.

  • The Trouble with Leader Elections (in distributed systems) by Joe Magerramov

    • Joe discusses the problems with leader election in distributed systems. While leader election lets a single host handle system-wide tasks using timed leases, it creates concerning tradeoffs. Leaders have outsized impact, creating large blast radii during deployments. Setting appropriate lease durations involves balancing between split-leader risks (multiple concurrent leaders) and liveness issues. Additionally, a leader might maintain its lease while failing to complete tasks effectively. The post suggests alternatives including localized leaders with smaller domains, idempotent co-leaders operating in parallel, or completely different approaches using queues and event-driven architectures.

Beyond Resilience: Worthy Reads

  • The Software Engineering Identity Crisis by Annie Vella

    • As AI coding assistants transform software development, engineers face a profound identity shift from creators to orchestrators. Many became engineers to build things with their own code, finding satisfaction in solving problems through craftsmanship. This evolution parallels the reluctant transition many engineers make to management, trading direct building for oversight. Three potential paths emerge: resist by focusing on domains requiring human expertise, adapt by embracing AI orchestration, or find balance through a pendulum approach that alternates between hands-on coding and AI guidance. Despite concerns about AI-generated code quality and maintenance challenges, this transformation may actually allow engineers to reclaim broader aspects of software development beyond mere coding.

  • Hackers Don't Check the Risk Register: Over-Reliance on Risk Management is Hurting Cybersecurity by Dan Glass

    • Dan Glass argues that modern cybersecurity has become dangerously focused on documenting risks rather than actively defending against them. He compares this to preparing for a fight by writing down your opponent's moves instead of actually training. Organizations spend excessive time on risk registers, impact assessments, and governance workflows while leaving security vulnerabilities unaddressed. Instead of "paper security," Glass advocates for hands-on technical approaches: comprehensive asset and vulnerability management, zero trust architecture that assumes breach, rapid detection and response capabilities, and rigorous configuration hardening.

  • In Praise of “Normal” Engineers by Charity Majors

    • The concept of a "10x engineer" persists despite flimsy research because we've all encountered exceptionally talented developers. However, this notion is problematic because productivity cannot be measured by a single metric across diverse engineering contexts, and individual capabilities change over time. More importantly, teams—not individuals—own software, making the performance of the collective far more critical than any single developer's abilities. The best engineering organizations are those where average engineers can consistently deliver value, not just places stacked with elite talent. Leaders should focus on creating supportive environments where engineers are hired for their unique strengths rather than lack of weaknesses, ultimately creating systems that mint world-class engineers instead of merely hunting for them.

  • Cloudflare Introduces AI Labyrinth to Combat Unauthorized Bot Crawling

    • Cloudflare launched AI Labyrinth, a defense mechanism using generative AI to create convincing decoy content that traps bots ignoring "no crawl" directives. Instead of blocking suspicious crawlers, the system serves pre-generated AI content through hidden links, wasting bot resources while collecting data to improve bot detection. This approach functions as a next-generation honeypot that can identify bots with high confidence when they follow these invisible links. Available to all customers including free tier users, AI Labyrinth requires just a simple dashboard toggle to activate.

  • How GitLab Measures Red Team Impact by Chris Moberly

    • GitLab's Red Team tracks "adoption rate metric" showing if security recommendations are implemented. They use GitLab labels to classify recommendations and track outcomes through a dynamic dashboard. Key lessons: implement metrics early, collaborate across teams, and use existing tools. Next focus: developing "threat resilience" metrics. Looking forward to read that one.

  • Rewilding Software Engineering: Chapter 4 Summary by Tudor Girba and Simon Wardley

    • Software engineering is a decision-making process hampered by ineffective tools. System explainability requires considering the system itself, how information is extracted, and how it's used. The authors discuss through examples like the "stuck cursor" that contextual tools solve seemingly impossible problems. Software's highly contextual nature means generic tools can't adequately address specific issues. They advocate for "dynamic exploration" where tools appear only in relevant contexts, making systems more explainable.

  • Locks, leases, fencing tokens, FizzBee! by Lorin Hochstein

    • FizzBee is a new formal specification language, originally announced back in May of last year. FizzBee’s author, Jayaprabhakar (JP) Kadarkarai, reached out to Lorin and asked him what he thought of it. In Lorin’s words: “It wasn’t until I went through the exercise of modeling it that I discovered something about its behavior that I hadn’t realized before”. Brilliant write-up!

  • Paxos made visual in FizzBee by Lorin Hochstein

    • Follow up from the previous post. Lorin explores the Paxos distributed consensus algorithm by modeling it in FizzBee.


Podcasts


Briefs from the social web

Mark Armour - “This is for my colleagues in the preparedness space. I'm very interested in knowing where opinions land on this. What do you think is more important to being able to respond and recover successfully? Let me stress: you have two options only.”

Solid results!

Russ Miles - “[…] I've often called this "Architecture Archaeology". The acceptance that any architectural view is immediately out of date and most useful for looking back only. Creating these perspectives is always useful but often very expensive, and if not repeated it ages poorly as any representation of what you really have now. […]”

James Summerfield - “Today, we're stepping out of stealth and introducing Phoebe, an agentic search tool designed specifically for engineers to accelerate troubleshooting across complex tech stacks. […]”

Yao Yue 岳峣 - “Resolution matters. Take CPU utilization as an example—what appears smooth and boring at a minutely granularity could be fluctuating absolutely wildly on a much finer time scale. We built a little demo to demonstrate this effect (link in comment).
We are building Rezolus, an OSS systems telemetry agent, and a suite of data and viz tools that follows, so finally you can see it, too.”

Brian Finster - “[…] The following are just some examples of bad ideas:
Individual task ownership
individual code ownership
One process for regular change and another for hotfixes
Testing teams
GitFlow
Long-lived branches for ANY reason
Command line access to production
Feature teams
You're welcome.”

Alexis Richardson - "The site is down - which YAML do we need to fix?"
It is time to build a better config solution. Too many broken configurations are leading to serious outages. Using more DevOps tools adds complexity. It is like swimming in spaghetti. A simpler way is needed, that can bring everyone forward.
Today we are announcing ConfigHub Inc. and our first funding round of $4M. […]”

Cory ODaniel - “The silent crisis in cloud operations isn’t tooling — it’s people. […]”

Steve Fenton - “Meanwhile in DevOps No.22
After reading A Field Guide to "Human" Error and Bob's Guide to Operational Learning, a quick detour into hashtag#HOP - it applies to our work in hashtag#DevOps too!”

Link to post

Sabith Venkit - “Building Resilience: The Three-Legged Stool of Recovery
Imagine a three-legged stool representing your organization's ability to recover from disruptions. Each leg is crucial: Recoverability, Immutability, and Integrity. If any leg is weak or missing, the stool collapses. […]”

Michael Hudson - “What's the cost of pursuing perfection?
Imagine a zero-error day where everything goes flawlessly and according to plan. Is that a dream scenario or one that's too safe, predictable and ultimately limiting? Most organizations claim to value innovation, yet simultaneously cultivate environments with low tolerance for the unpredictable outcomes and uncertain experiments that true innovation requires. […]”

John Allspaw -
“1. PEOPLE keep things working.
2. When things break down, PEOPLE work to make the consequences much less than they might have been otherwise.
Both dynamics are, for the most part, invisible to management.
(via Laura Maguire, PhD's dissertation)”

Andrea Laforgia - “"My team is great, they haven't produced a defect in a long time!"
I've heard managers say this with pride, but the truth is that "defects per unit of time" is a terrible metric for evaluating a team's success. […]”

Ilya Bezdelev - “One reason Amazon has great operational excellence is the "you built it, you run it" approach to software maintenance. There are no separate SRE teams, the dev team does their own SRE. You don't want to get paged at night, so you build systems that you can maintain and fix if they break. […]”

Jay Gengelbach - “Terms every engineer should know: Chesterton's Fence
This comes from a small parable by G.K. Chesterton, wherein he argues that one should not tear down a fence that someone else has built just because you don't currently see the value of the fence. Instead, you should study: understand why someone spent time and resources erecting that fence. Only when you understand why it exists can you determine whether it's now obsolete. […]”

Mike Rayo - “As we listen and learn from more and more organizations, I grow increasingly convinced that the most important ingredient to beginning and sustaining a New Look/New View/Safety II/Safety Differently/Capability-based/Adaptation-based aspect for your Systems Performance (notice I didn't say safety?) group is the willingness of your organization, however temporary, to spend $1 (Euro, peseta, etc.) or one minute of someone's time explicitly engaged in understanding and supporting what their people are doing RIGHT NOW to keep the system running, and keep it running safely. […]”


Videos of the month

Ever wondered how AWS thinks about, and builds, resilience into everything it does? My former colleague and Principal Technologist for Financial Services, Robert Charlton explains how they do it in this nice series of videos!

Building Resilient Cloud Services (Part 1): A modern approach | Amazon Web Services

Building Resilient Cloud Services (Part 2): What is an AWS Region? | Amazon Web Services

Building Resilient Cloud Services (Part 3): AWS data center innovations - engineering for resilience


Upcoming Conferences


Previous
Previous

Resilience Bites #9 - LinkedIn Rewind (week 14)

Next
Next

Resilience Bites #7 - LinkedIn Rewind (week 13)