Resilience Bites #15 - LinkedIn Rewind (week 25-26)
Hi everyone,
Welcome to the week 25-26 edition of LinkedIn Rewind!
I am sorry. I've taken a brief pause from the newsletter over the past few weeks to focus on something exciting at Resilium Labs. I'll be sharing more details in the coming weeks, but for now, I'm excited to get back to sharing insights that matter most for building resilient systems.
This edition covers essential topics from verifying your runbooks to embracing the prevention paradox. I hope you find these perspectives interesting.
Until then, I hope you enjoy this rewind.
The Linkedin Rewind
We test our code continuously, deploy multiple times a day, but somehow our runbooks are written once and forgotten.
Most organizations forget that runbooks don't stay accurate by themselves. You have to test and update them regularly, just like your code, or they'll be useless when you really need them.
[…]
Have you heard the saying 'Dogs Not Barking'? I love this term so much because it captures something so critical to resilience.
I first heard it in a Weekly Ops meeting at AWS about 8 years ago. One of the principal engineers on the call was discussing an incident, explaining that the lack of logs had been an important signal that needed to be monitored.
'Dogs Not Barking' comes from Sir Arthur Conan Doyle's Sherlock Holmes story, The Adventure of Silver Blaze.
[…]
"Have you tried turning it off and on again?"
We've all heard this classic IT support line and probably rolled our eyes. It sounds too simple, even lazy. But there's actual wisdom hidden in this advice.
Designing for recovery is one of the most overlooked aspects of building resilient systems. While many organizations focus on preventing failures, the truth is that failures will happen, no matter how much money you invest in prevention.
Your monitoring and communication channels will fail during the worst possible moment. Are you ready?
In 20 years, I've had to "fly blind" without monitoring three times. Lost all communications twice. Always during critical incidents.
I recently worked with a customer that lost their entire monitoring stack during an outage. While many would have panicked, they calmly switched to manual checks they'd practiced. They recovered in 20 minutes.
Start with what works, not what's broken.
When I start working with organizations on improving their resilience, I never start by asking "What's failing?"
I ask "What's working really well?" And that always surprises them.
Most organizations expect me to audit their problems and fix them. Instead, I want to understand their successes.
There's a simple reason for that. People resist being told what's wrong. They embrace being told what's right.
[…]
Don’t let poor metrics mislead your organization
Here's a simple math problem that breaks many organizations.
You have 10 incidents. 9 of them resolve in 5 minutes each. 1 takes 6 hours.
Your MTTR says 40.5 minutes
Do you think it tells a good story?
[…]
I recently spoke with a Staff Engineer whose world-class resilience team was being "rightsized" after two years without major outages. Their crime? Being too successful at preventing disasters.
This is the Prevention Paradox. When effective failure prevention makes itself appear unnecessary.
Continue reading on Resilium Labs Blog