Resilience Bites #17- LinkedIn Rewind (week 27-28)
Hi everyone,
Welcome to the week 27-28 edition of LinkedIn Rewind!
This edition covers essential topics from MTTR, cell architecture, to perfectionism.
I hope you find these perspectives interesting.
Until then, I hope you enjoy this rewind.
The Linkedin Rewind
Your MTTR dashboard is lying to you.
Here's a simple math problem: You have 10 incidents. 9 resolve in 5 minutes, 1 takes 6 hours. Your MTTR says 40.5 minutes.
Does that tell the real story?
I just published a new blog post that breaks down why MTTR misleads engineering teams and also shared practical alternatives you can implement immediately.
I hope you enjoy it! Let me know :)
[…]
I've been holding off on commenting about the Google Cloud outage from June 13th, wanting to let the dust settle and see how the conversation evolved.
After reading through different threads and industry takes, I'm surprised by how much of the discourse revolves around persistent myths about failure and reliability.
[…]
Cell-based architecture is one of the my favorite ways to contain failure and prevent its propagation.
In a cell-based architecture, resources and requests are partitioned into cells. Cells are multiple instantiations of the same service isolated from each other.
However, these service structures are invisible to customers. Each customer gets assigned a cell or a set of cells; this is also called sharding customers.
You can actually see when teams display resilient behavior.
It's most obvious when things go wrong. While the team might be under pressure, they still turn failures into learning moments instead of blame sessions.
Their first reaction is "that's interesting" not "who screwed up?" It's honestly beautiful to watch.
Continue reading on LinkedIn
My favorite way to improve resilience is conducting GameDays.
In the early 2000s, Jesse Robbins, who was "Master of Disaster" at Amazon, created and led a program called GameDay, inspired by his experience training as a firefighter.
That program was designed to test, train, and prepare Amazon systems, software, and people to respond to disasters.
One must spend approximately 600 hours training before becoming an active-duty firefighter. And that’s just the beginning. After that, some firefighters spend over 80% of their active-duty time in training, because when they operate under live-fire conditions, they need to rely on intuition to understand the fire they are fighting. To acquire that lifesaving intuition, they must train for hours on end. As the old adage says, "practice makes perfect."
[…]
Using a resilience score to drive resilience improvements sounds like a great idea. But the truth is that it can do more damage than good.
A resilience score is calculated by awarding points to services that adopt proven resilience best practices, resiliency policy, alarms, standard operating procedures, test coverage, chaos engineering experiments, etc.
That score allows service owners to track their efforts to become resilient. Sounds great?
[…]
I've often found (and still find) myself getting paralyzed by this desire to be perfect. And that's exactly what happened when I started trying to do chaos engineering too.
I had read all the books and blog posts I could find, went to the conferences, listened to the experts. The whole thing. But I just couldn't seem to actually start doing chaos engineering at my company. I wanted to, I really needed to. I was just so worried about it. I wanted to have the perfect strategy, the perfect tools, the perfect monitoring setup, before even running my first experiment. Like, I thought I needed to have everything dialed in perfectly first.
Continue reading on LinkedIn