Resilience Bites #2 - What the Internet said last month!

Feb 1, 2025


In this January edition of Resilience Bites, I’ve curated a selection of noteworthy discussions and findings related to resilience from this past month.

I hope you find these highlights insightful and that they inspire you to explore the world of resilience engineering further.

My 'coups de cœur' have one ❤️ or two next to it.

Happy reading!

Must-check highlights

  • I stumbled upon a great series of blog posts by Uwe Friedrichsen about Resilience. It's one of the best pieces I've read on this topic in a long time. He covers important basics that everyone needs to understand. Throughout the series, he compares building resilience to climbing a mountain. If you care about Resilience or if your organization is working on it, you really need to read this. ❤️❤️

  • Catchpoint's SRE 2025 report - Some interesting but not surprising points.

    • "Slow is the new down" - 53% of organizations agree that poor performance is as bad as downtime. For the first time in five years, toil has increased to 30% from 25% in 2024, despite advancements in automation and AI. SRE teams are still trying to figure out how to use AI. Over two-thirds of respondents feel pressured to prioritize release schedules over reliability. This isn't surprising. I see it all the time. I've talked about the prevention paradox before - this is it in action. Rapid development and maintaining system stability is a difficult tradeoff for organizations. It's comparing creation vs prevention, and the human brain isn't good at it. See my LinkedIn post on the prevention paradox. 40% of respondents reported handling between 1 and 5 incidents in the past 30 days. There are significant differences in perceptions of reliability practices between individual contributors and higher management. Again, I am not surprised. I often see reliability goals that are not aligned with business goals, making them difficult to justify.


Podcasts

  • Observability: the present and future with Charity Majors ❤️

    • Great, great discussion on the state and evolution of observability. Charity talks about how traditional static dashboards are insufficient for modern software systems, advocating instead for dynamic, interactive tools that let engineers engage with and query their data to understand system behavior. She also talks about the importance of observability in the context of AI-generated code.

    • “SLOs provide a budget for teams to run chaos engineering experiments.”

    • I wrote about Amazon Search using SLOs to run chaos engineering experiments here.

  • Rethinking business continuity before the next big IT outage by Beth Pariseau

    • "In other engineering disciplines -- including aerospace engineering -- resilience engineering and business resilience are better understood than they are in IT, particularly as the focus for technologists has trended toward velocity in the cloud era, Betz said. "We in the [IT] industry are still babies at doing this," he said. "IT needs to start adopting some of these practices from domains that understand this a lot better."

  • The “this is fine!” podcast

Video of the month

Try again: The tools and techniques behind resilient systems by Marc Brooker

Grand architectural theories are nice, but what makes systems resilient is in the details. Marc Brooker, VP, and distinguished engineer, looks at some of the resiliency tools and techniques AWS uses in its systems. Marc rethinks, retries, breaks open circuit breakers, decodes erasure coding, and tackles the tail. Learn about formal methods and simulation and how these tools help build faster code faster.

Try again: The tools and techniques behind resilient systems by Marc Brooker - from re:Invent 2024


Upcoming Conferences


Previous
Previous

Resilience Bites #3 - What the Internet said last month!

Next
Next

Resilience Bites #1 - What the Internet said last month!