Resilience Bites #3 - What the Internet said last month!

Feb 28, 2025


In this February edition of Resilience Bites, I’ve curated a selection of noteworthy discussions and findings related to resilience from this past month.

I hope you find these highlights insightful and that they inspire you to explore the world of resilience engineering further.

My 'coups de cœur' have one ❤️ or two next to it.

Happy reading!


Blogs highlights


Briefs from the social web

Tudor Girba - “If you denote your system as legacy, chances are you perceive it as an impediment from a business point of view. Well, your legacy can become an opportunity.”

Ilya Bezdelev - “After the massive S3 outage in 2017, the COE showed that an engineer made a typo in a command that removed thousands of instances, causing half of the internet to go down. In the aftermath of the incident, all of us at AWS received a memo that effective immediately, teams must review their processes and systems and automate everything that they can automate. If a command has to be run manually, it has to be run by two people. It was dubbed the "2P rule" jokingly referred to as a "to pee" rule, as in "I need you 2P with me…”

Laurent Domb - “I am pleased to share that we've added new experiment capabilities to run network faults in ECS Fargate environments as part of our AWS Chaos Engineering Workshop using the AWS native Fault Injection Service!”

Sam Newman - “This is a facinating writeup by Eran Stiller over at InfoQ of Monzo Bank's approach to ensuring that critical banking functionality is still available even in the face of major outage: https://lnkd.in/d545hsMJ. Thanks also to Monzo's Daniel Chatfield for sharing his thoughts with Eran, to make this such an insightful read.”

Matthias Patzak - "You build it, you run it" isn't just a catchy phrase.
It's the core principle that transforms team accountability and software quality…”

Jonathan Courtney - “Most people want to sit around and criticise without ever putting themselves out there. Those people rarely get what they want. They don't find customers, people don't use their products.”

Lee Hannigan - “Imagine launching a globally replicated database without worrying about infrastructure, failover, or operational overhead. With Amazon DynamoDB, it’s as simple as this.“

Mike Rayo, PhD - “ the most important ingredient to beginning and sustaining a New Look/New View/Safety II/Safety Differently/Capability-based/Adaptation-based aspect for your Systems Performance […] is understanding and supporting what their people are doing RIGHT NOW to keep the system running, and keep it running safely. “

Stephen Whitworth - “An investor asked me recently: "what will the impact of tools like Cursor, Lovable, etc, be on software reliability? Good question. Here's what I think will happen.”

Jay Gengelbach - “Terms every engineer should know: Chesterton's Fence …”

Joao Neto -When something breaks, the first reaction is often: "Why didn’t we test for that? We need more tests!" This has always boggled my mind. Testing is seen as the primary way to achieve quality, but adding more tests rarely addresses the root cause - it just patches the symptoms.”

itronitron - “Improving efficiency in a system generally reduces it's resilience, so improving efficiency is definitely not always a good thing.”

ColinWright - “There's an important point in here: "Waste" isn't always waste ... sometimes it's built-in resilience. I remember working on a communications system where management were hell-bent on squeezing every last iota from the capacity. We compressed, we coordinated, we worked on making sure the system had no "waste". Then when someone inadvertently sent something they didn't expect, the entire system ground to a halt, unable to adapt, unable to limp along until it got sorted. Spare capacity is essential, and only looks like waste to those who are insufficiently grounded in relevant engineering principles.”


Podcasts

Video of the month

GitOps Best Practices Every DevOps Team Should Follow in 2025 by Christian Hernandez ❤️❤️

This GitOps presentation by Christian Hernandez is exceptional and worth every second of your time. Even seasoned operators will get some value out of it!

GitOps Best Practices Every DevOps Team Should Follow in 2025



Previous
Previous

Resilience Bites #4 - LinkedIn Rewind (week 9)

Next
Next

Resilience Bites #2 - What the Internet said last month!