Resilience Bites #12 - LinkedIn Rewind (week 19)
May 11th, 2025
Hi everyone,
Welcome to this combined edition of LinkedIn Rewind!
I've merged the past couple of weeks' highlights as I moved from my winter cavern far above the Arctic Circle (where I spent the winter chasing northern lights and fresh powder) to my peaceful southern forest hut. Thank you for being so patient!
These past two weeks, I've discussed how resilience emerges from questioning the status quo, bridging the gap between theoretical and actual system behavior, and embracing the irreplaceable human element that makes our systems truly adaptive. I've also shared how thoughtfully designed chaos experiments act like vaccines for our systems, causing minor discomfort now to prevent major pain later, while helping teams escape the demoralizing cycle of constant firefighting that drains creativity and innovation. Every organization faces a choice between investing in resilience today and paying a steeper price tomorrow, especially as recovery speed becomes very important in a world where complex failures are inevitable, but their impact doesn't have to be.
Let's jump right into it!
The Linkedin Rewind
Gamechangers in Resilience: The Prevention Paradox
Last week, I did an interview for the “Gamechangers in Resilience” series published by Iluminr.
In this interview, I discuss resilience by connecting the dots between Finnish saunas, system design, and teamwork. I explain how jumping from hot saunas into frozen lakes is similar to chaos engineering—our bodies get stronger through controlled stress.
I challenge some common practices that don't work well, like overly detailed playbooks, complex automation that backfires, and the myth of a single "root cause" when things break. For me, resilience isn't about preventing every possible failure— it's about how you respond when things inevitably go wrong.
I talk about why curious teams beat blame-focused ones, why simpler solutions often beat complex ones, and why principles beat rigid procedures. I explain the "prevention paradox" (where good prevention work becomes invisible), why architecture reviews miss the spaces between components, and how I'd build resilient organizations from scratch.
Most importantly, I explain that humans make systems resilient, not tech: by being comfortable with uncertainty, sharing knowledge widely, maintaining some slack capacity, and treating incidents like playing jazz, where improvisation matters more than sticking to the sheet music.
The prevention paradox might be the most significant obstacle organizations face when building resilient systems.
And yet, hardly anyone talks about it.
When we successfully prevent a threat before it happens, people often wonder if the threat was even real in the first place.
Your audience isn’t necessarily your customer
A hard-learned lesson from my first few months as an independent consultant: having a large audience doesn't automatically translate to having clients.
I've spent years building a following around resilience and chaos engineering content. People read my posts, engage with my ideas, and share my work, which has been immensely gratifying (thank you!)
[…]
Continue reading on LinkedIn
Everyone has a plan until they get punched
This is a gentle reminder that having a plan does not mean your organization is resilient ... unless you consistently exercise it, test it under diverse conditions, and continuously adapt it based on what you learn.
Continue reading on LinkedIn
The past does not repeat itself
Too often, I see organizations use past incident reports as their primary source of risk analysis, believing they're being data-driven.
This is a problem because in complex systems, the next major failure almost never repeats the last one.
In traditional engineering, past performance is often a reliable predictor of future failures. For example, a bridge that has stood for 50 years has proven its resilience against known stressors.
But software systems are fundamentally different. Software exists in a constantly changing world.
[…]
The difference between “should” and “how”
Documentation explains how a system should work. Resilience comes from understanding how it actually works.
The gap between these two is where outages hide.
Run GameDays to build intuition for your system's real behavior, not its theoretical one
Continue reading on LinkedIn
Question established practices
I think we should question established practices more often. Regularly asking 'Why do we do things this way?' and 'Is there a better approach?' helps us spot problems before they become disasters and find better solutions. That's what a healthy resilient organization looks like.
Building immunity with chaos engineering
Chaos engineering doesn't create problems, it reveals them before they hurt you.
Like a vaccine uses a weakened pathogen to build immunity, controlled failure experiments build technical and organizational immunity to outages.
The mild discomfort now prevents severe pain later.
Continue reading on LinkedIn
Limitations of technical solutions
Let me tell you one truth today!
Technical solutions can only take you so far. Humans are what make your systems resilient.
The ability to respond creatively to unexpected situations can't be automated away.
Your most important resilience pattern isn't in your code or architecture; it's in your team's capacity to adapt.'
The "firefighting trap.” I’ve seen that pattern affect many organizations.
It goes something like this:
1 - Push a seemingly harmless update
2 - Users start complaining
3 - Team scrambles to identify and fix the issue
4 - Create new alerts and metrics to catch similar issues
5 - Repeat with a different issue next week
This pattern is exhausting, demoralizing, and ultimately unsustainable, but it doesn't have to be this way.
[…]
If you’ve ever lived through a production incident, you’ve already lived through a miniature version of what happened today to Spain's power grid.
In complex systems, failures rarely stay isolated:
- One service slows down or fails.
- Other services start retrying aggressively.
- Queues backed up.
- Systems that depend on those queues start timing out.
- Alert storms blind operators.
- Secondary services fail because their dependencies have gone dark.
- Before long, the entire system collapsed, not because of the initial fault but because the stress cascaded faster than it could be contained.
Sound familiar?
Systems fail. The most important question is how fast you recover.
Yet, so often fast recovery is an afterthought, or completely forgotten... until it is needed most.
After nearly two decades building systems, here's what I've found actually improves service recovery:
[…]
Choosing between cost and resilience
In every engineering organization, there's a moment of choice. Invest in resilience now or pay the price later.
Yes, building resilient systems is expensive.
You must invest in detection, recovery, deployments, prevention, and mitigation.
You need retries, timeouts, circuit breakers, heartbeats, health checks, failovers, jitter, queues, idempotency, and redundancy.
You should practice chaos engineering, load testing, and incident simulations.
And the list goes on.