Resilience Bites #16 - What the Internet said last month!


In this edition of Resilience Bites, I've curated a selection of great reads related to resilience from this past month, June 2025.

I hope you find these interesting and that they inspire you to explore the world of resilience further.

You can also follow the #ResilienceBites hashtag on LinkedIn to see all the posts I've tagged from the people I follow throughout the month.

Happy reading!


Shoutout

“This was /not/ a fun outage for me. Even though I was not directly in the full path of this outage, and the bulk of the systems I’m responsible for saw only minimal effects, I’m part of the load balancing SRE team, and was oncall during this time. Paged too many times to recall fully. I’ll be taking some personal learnings from this, and after writing them up I’ll post them. In the meantime, I’ll be taking some time to recover. Any other folks that go through challenging times like these, please take time to recover.”

A very important message from Tobias Weingartner, SRE at Google, reflecting on the aftermath of the Google outage experience. Important because it is a reminder that behind every outage are real people dealing with stress, pages, and pressure. It models healthy incident response behavior by acknowledging that people need recovery time and that burnout prevention is essential for sustainable operations.

Blogs highlights

Why Slack outshines Zoom for incident management by Brent Chapman

Good analysis of why text-based channels work better than video conferences for incident response. Brent argues that Slack's asynchronous nature allows responders to focus on specific tasks while staying informed, while Zoom calls demand everyone's primary attention and become "inherently single-threaded." The points about persistent searchable records and better tool integrations are real. "People who prefer to work in Zoom often think that they’ll remember to capture important information, discoveries, and decisions in Slack, but they seldom follow through on this effectively. In the heat of the moment, they become so engrossed in the conversation that critical information and insights never get captured in Slack, where they’d be most useful for later responders and post-incident reviewers." This is 100% true. Don't rely on people remembering to do the right thing under stress. Instead, design systems that capture information by default (like using Slack as the primary communication channel) rather than requiring extra cognitive overhead to transfer information between tools.

Big Enough to Fail by Will Gallego

Interesting piece on why we're more forgiving when major cloud providers fail versus smaller services. Will’s theory: when external dependencies become "so tightly coupled, large, and fundamental," blame actually decreases during failures because the scale makes it feel exceptional and unavoidable. The recent GCP and Cloudflare outages are a good example of this. Widespread failures become more "understandable" than mid-tier SaaS outages. Smart reframe: Instead of "pick a service that never goes down," use these incidents to "highlight gaps in the system and produce insights" about your own resilience capabilities. Good reminder that "no one's perfect" applies to even the biggest tech companies.

From Crisis to Catalyst: How HR Leaders Can Reinvent Culture and Continuity by Ayme Zemke

Important perspective about the role of HR in helping build organizational resilience. The article suggests moving beyond wellness perks to embed support into workload expectations and leadership behavior. Key insight: "Organizations build resilience when they prioritize the physical, emotional, and mental well-being of employees." The focus on training leaders to spot stress early and creating psychological safety for discussing capacity limits is particularly important for sustainable performance (See shoutout below.)

SLA vs SLO by Alex Ewerlof

Good breakdown of often-confused terms in software engineering. Alex shared his simple test: "An easy way to tell the difference between an SLO and an SLA is to ask, 'What happens if the SLOs aren't met?' If there is no explicit consequence, then you are almost certainly looking at an SLO." Particularly important point: SLAs should promise less reliability than internal SLOs - "if internally we aim for 99.99%, the SLA we commit externally may be 99.5%." This gives you space to learn, adapt, and improve without the pressure of legal consequences. Once you truly understand your system's capabilities and have improved its reliability, you can tighten that gap.

Can You Smell the Next Sev-1? by Hamed Silatani

A practical post on developing intuition for detecting incoming severe incidents before they fully manifest. Hamed explores the "early warning signals" that experienced engineers develop through pattern recognition and environmental awareness. Good discussion of how seasoned responders learn to "smell" trouble through subtle system behavior changes, unusual metric patterns, and environmental factors. Valuable for teams looking to build proactive incident detection capabilities.

What Went Well is More Than Just a Pat on the Back by Lorin Hochstein

Excellent post on the deeper value of "what went well" discussions in retrospectives. Lorin argues that celebrating successes isn't just about morale but about understanding and reinforcing the adaptive behaviors that prevent worse outcomes, and if you know me, you will know that I couldn’t agree more. Focuses on how teams successfully navigate complexity and uncertainty, which is often invisible in traditional incident analysis. Important reminder that resilience comes from understanding what works, not just what breaks.

Resilience is Part of the Product, Not an Afterthought by Martin Hinshelwood

Strong argument for embedding resilience thinking into product development from the start. Martin emphasizes that "Failures were not exceptional. Failures were normal. Resilience was not improvised. It was engineered." Good discussion of building resilience as a core design principle rather than bolting it on later. I want to point out that, in my opinion, the post incorrectly frames this as "designing resilience" rather than "designing for conditions that support resilient behavior." This distinction is important because it influences how we approach building systems. Do you focus on rigid engineered solutions, or do you focus on creating adaptive capacity and supporting human expertise? The concrete capabilities framework and rejection of hero culture are particularly valuable. Still worth sharing because the mindset is practical for implementation, even if the framing could be more nuanced.


Podcasts

Stress Test Your Strategy Before It Fails with Arjan Singh (HBR Podcast)

Excellent deep dive into corporate wargaming as a form of competitive simulation and stress testing. Arjan explains how these "dress rehearsals" help companies test strategies before deployment, moving beyond traditional scenario planning by adding layers of likely competitor actions and appropriate responses. The pharmaceutical case study is particularly interesting. A company practiced the "FDA rejection" scenario in their wargame, then executed their prepared playbook when it actually happened. The key insight: successful wargaming requires dedicated time (8-12 hours minimum), senior leadership involvement, and most importantly, turning insights into actionable playbooks with named owners. Great listening for anyone interested in gamedays and simulations.

The Modern Observability Roundtable: AI, Rising Costs, and OpenTelemetry by The New Stack Podcast

Timely discussion of current observability challenges, particularly around cost management in the age of AI and high-cardinality data. The roundtable covers practical strategies for managing telemetry costs while maintaining visibility, the role of OpenTelemetry in standardization, and how AI is both helping and complicating observability practices. Good insights on balancing comprehensive monitoring with budget constraints.

Google SRE Prodcast: The One With SLOs and Sal Furino

Discussions about Service Level Objectives and practical SRE implementation. Good insights from practitioners on how SLOs work in practice at scale, common pitfalls, and lessons learned from real implementations.

Google SRE Prodcast: The One with STPA

Deep dive into Systems-Theoretic Process Analysis (STPA) for hazard identification and safety analysis. STPA provides a systematic approach to identifying potential failure modes in complex systems by focusing on control structure and constraints rather than traditional component failure analysis. Great for anyone working on safety-critical systems or wanting to understand modern hazard analysis techniques.



Previous
Previous

Resilience Bites #17- LinkedIn Rewind (week 27-28)

Next
Next

Resilience Bites #15 - LinkedIn Rewind (week 25-26)