Resilience Bites #15 - LinkedIn Rewind (week 25-26)

Resilience Bites

Jun 22

Hi everyone,

Welcome to the week 25-26 edition of LinkedIn Rewind!

I am sorry. I've taken a brief pause from the newsletter over the past few weeks to focus on something exciting at Resilium Labs. I'll be sharing more details in the coming weeks, but for now, I'm excited to get back to sharing insights that matter most for building resilient systems.

This edition covers essential topics from verifying your runbooks to embracing the prevention paradox. I hope you find these perspectives interesting.

Until then, I hope you enjoy this rewind.

The Linkedin Rewind

Verifying runbooks

We test our code continuously, deploy multiple times a day, but somehow our runbooks are written once and forgotten.

Most organizations forget that runbooks don't stay accurate by themselves. You have to test and update them regularly, just like your code, or they'll be useless when you really need them.

[…]

Continue reading on LinkedIn

Dogs not barking!

Have you heard the saying 'Dogs Not Barking'? I love this term so much because it captures something so critical to resilience.

I first heard it in a Weekly Ops meeting at AWS about 8 years ago. One of the principal engineers on the call was discussing an incident, explaining that the lack of logs had been an important signal that needed to be monitored.

'Dogs Not Barking' comes from Sir Arthur Conan Doyle's Sherlock Holmes story, The Adventure of Silver Blaze.

[…]

Continue reading on LinkedIn

Turn it off!

"Have you tried turning it off and on again?"

We've all heard this classic IT support line and probably rolled our eyes. It sounds too simple, even lazy. But there's actual wisdom hidden in this advice.

Designing for recovery is one of the most overlooked aspects of building resilient systems. While many organizations focus on preventing failures, the truth is that failures will happen, no matter how much money you invest in prevention.

[…]

Continue reading on LinkedIn

Flying blind

Your monitoring and communication channels will fail during the worst possible moment. Are you ready?

In 20 years, I've had to "fly blind" without monitoring three times. Lost all communications twice. Always during critical incidents.

I recently worked with a customer that lost their entire monitoring stack during an outage. While many would have panicked, they calmly switched to manual checks they'd practiced. They recovered in 20 minutes.

[…]

Continue reading on LinkedIn

Start with what works, not what's broken.

When I start working with organizations on improving their resilience, I never start by asking "What's failing?"

I ask "What's working really well?" And that always surprises them.

Most organizations expect me to audit their problems and fix them. Instead, I want to understand their successes.

There's a simple reason for that. People resist being told what's wrong. They embrace being told what's right.

[…]

Continue reading on LinkedIn

Don’t let poor metrics mislead your organization

Here's a simple math problem that breaks many organizations.

You have 10 incidents. 9 of them resolve in 5 minutes each. 1 takes 6 hours.

Your MTTR says 40.5 minutes

Do you think it tells a good story?

[…]

Continue reading on LinkedIn

The prevention paradox

I recently spoke with a Staff Engineer whose world-class resilience team was being "rightsized" after two years without major outages. Their crime? Being too successful at preventing disasters.

This is the Prevention Paradox. When effective failure prevention makes itself appear unnecessary.

Continue reading on Resilium Labs Blog

resiliencechaos engineeringengineeringcloud computingawssoftwaredistributed systemsresilience engineeringsoftware systemsSREDevOps

Adrian Hornsby

Resilience Bites #15 - LinkedIn Rewind (week 25-26)

The Linkedin Rewind

Resilience Bites #16 - What the Internet said last month!

Resilience Bites #14 - What the Internet said last month!

adhorn.me