Resilience Bites #4 - LinkedIn Rewind (week 9)

March 9th, 2025


Hi everyone,

I've received several messages from you asking for a recap of my LinkedIn content, and I thought it would be a great idea to give it a try!

With LinkedIn's algorithm becoming increasingly unpredictable, many of you have mentioned not seeing my posts consistently in your feed anymore. This newsletter is my attempt to ensure you don't miss any valuable insights or updates I've shared throughout the week.

Think of this as your one-stop summary of my LinkedIn activity - the key discussions, resources, and thoughts that might have slipped past your feed.

I'm excited to experiment with this format and would love your feedback on whether you find it useful.

Let's dive into this week's highlights!


The Linkedin Roundup

Scheduled tasks and Jitter

Systems commonly fail when scheduled tasks like reporting jobs, marketing campaigns, and backup operations all begin simultaneously, creating resource contention that can crash critical services. Adding randomization or "jitter" to these scheduled tasks distributes the workload over time, preventing the aggressive competition for CPU, memory, and network resources that occurs when everything starts at fixed intervals. This simple technique of staggering execution times creates natural resilience against self-inflicted denial of service, improving overall system stability.

Retry stroms

Most systems collapse under high load due to "retry storms" where well-intentioned retry logic creates exponential cascades of requests throughout service chains. When multiple services in a chain each implement multiple retry attempts, a single failure can multiply dramatically. This problem is particularly insidious because developers typically only see their local part of the service chain, leading to uncoordinated retry strategies that collectively become destructive during peak loads, masking underlying problems during normal operations while creating the conditions for catastrophic failure when the system is stressed.

Chaos engineering vs code reviews and tests

Software testing and code reviews validate known scenarios and expected behaviors following a predetermined script, while chaos engineering deliberately introduces unexpected conditions to reveal how systems actually respond under stress. Chaos engineering exposes hidden interdependencies, non-linear interactions between components, and emergent behaviors that only appear during real-world failures—similar to how fire drills reveal practical weaknesses that theoretical planning misses. Perhaps most valuably, the planning process for chaos experiments often reveals that team members have fundamentally different assumptions about how their system works, allowing these misconceptions to be addressed before they contribute to a major outage.

High-latency degradation

High-latency degradation is more damaging than outright failure because systems get trapped in a resource-consuming limbo instead of cleanly failing and recovering. These "gray failures" occur when services operate with overly optimistic assumptions—generous timeouts, excessive retries, and connection pools that don't release resources quickly enough—leading to cascading problems as delayed requests accumulate and block new ones from being processed. The solution requires aggressive timeouts, limited retries, latency-based circuit breakers, and prioritizing resource release over recovery attempts, essentially redesigning systems to "fail fast" rather than struggling to maintain degraded service that ultimately consumes more resources without delivering value.

Heterogeneous redundancy

Monzo Bank's "Stand-in" platform exemplifies the principle of heterogeneous redundancy by implementing a minimal backup system that maintains only critical functions like transactions and balance checks rather than mirroring their entire infrastructure. What makes their approach uniquely successful is their continuous testing—they always have some customers using the Stand-in platform during normal operations, ensuring their redundancy is a living part of their system rather than an untested theoretical safety net. This practice transforms disaster preparedness from a static "build and forget" insurance policy into an active learning opportunity that continuously improves system resilience.

Critical System Bias

Major system outages typically don't originate in heavily protected "critical" components but rather in overlooked, supposedly "low-risk" support systems that receive minimal attention. This "Critical System Bias" leads organizations to overprotect perceived high-value systems while neglecting interconnected support systems that can trigger cascading failures, as demonstrated by the 2017 AWS S3 outage (which began in a routine billing system operation) and the 2019 Google Cloud outage (triggered by a simple configuration change). Chaos engineering helps counter this bias by democratizing resilience testing across all system components, revealing the true interconnected nature of complex systems and showing how seemingly unimportant services can bring down entire platforms.

Fail-open vs fail-closed

If authentication services fail, organizations face a critical choice: fail-closed (block all access) or fail-open (allow access without authentication). This decision reveals organizational priorities—choosing between availability and security. A real-world example involved a video streaming service deciding to fail-open during authentication failures, prioritizing continuous service for millions of paying customers over the risk of some unauthorized access. High-availability services like streaming and e-commerce often implement fail-open strategies, while security-critical systems like finance and healthcare typically implement fail-closed approaches. The most resilient systems emerge when these decisions are made collaboratively across engineering, security, product, and business teams rather than being treated as purely technical choices.

Chaos Engineering and AI

I've been thinking about how chaos engineering is becoming even more critical now that AI-generated systems are becoming more and more popular. While AI helps us build complex systems faster than ever, it also creates new challenges in understanding how these systems actually work.

In this post, I am sharing some ideas from Bainbridge's "Ironies of Automation," the "70% problem" in AI development, and why chaos engineering might be our best tool for helping us understand these systems as AI abstractions multiply. I would love to hear about your experiences with AI-generated systems and how you're managing or thinking about complexity.


Previous
Previous

Resilience Bites #5 - LinkedIn Rewind (week 10)

Next
Next

Resilience Bites #3 - What the Internet said last month!