Resilience Bites #6 - LinkedIn Rewind (week 12)
March 22th, 2025
Hi everyone,
Welcome to my weekly LinkedIn recap!
I'm delighted to see the growing community around these weekly summaries. Thank you for your continued engagement and thoughtful comments on last week's edition.
Let's dive right into this week's highlights!
The Linkedin Roundup
🚀 I am thrilled to share some big news! I'm officially open for business as an independent consultant! 🎉
If you're looking to improve your systems' resilience or know someone who is, let's connect!
This Agentic DevOps solution promises AI-native, self-operating infrastructure management, but raises important questions about handling non-deterministic behavior and hidden complexity in autonomous systems. When AI agents make increasingly independent and potentially opaque decisions across infrastructure, recovery from inevitable failures becomes more challenging. The future likely isn't about fully autonomous systems but finding the optimal balance between AI capabilities and human oversight. This inflection point requires new frameworks for responsibility, observability, and control as the industry navigates the evolution of these systems and addresses fundamental questions about AI-human collaboration in infrastructure management.
Read the post on Linkedin
Bi-modal behavior in software systems occurs when identical requests can follow completely different execution paths depending on the system state, creating invisible complexity that only becomes apparent during crises. Common examples include cache hits versus misses, circuit breaker states, and failover systems—each effectively doubling the possible system states and creating interaction patterns that emerge only under specific conditions. This hidden complexity remains invisible during normal operations but becomes critically important during incidents when requests suddenly behave differently. To manage bi-modal behavior effectively, teams should document both paths explicitly, test both modes during chaos experiments, ensure clear observability of operating modes, design for consistent performance when possible, and regularly exercise alternative paths during normal operations rather than leaving them as untested emergency routes.
Read the post on Linkedin
On the similarities between “Glue work” and resilience engineering
"Glue work" in engineering describes vital activities beyond coding—helping junior developers, updating roadmaps, talking with users, tracking tasks, and ensuring alignment—that often go unrecognized despite being critical to project success. This parallels resilience engineering, as both involve invisible work that prevents failures rather than builds new features, both suffer from the "prevention paradox" (when done well, nothing bad happens), and both defy traditional measurement methods. These roles operate at the intersection of human and technical systems, requiring what Laura Maguire calls "Adaptive Choreography"—fluid, dynamic coordination rather than rigid procedures. These activities represent investments in future collaboration and system resilience, creating expertise in both technical and human domains that prevents system failures before they occur.
Read the post on Linkedin
The breakdown in organizational resilience
A senior engineer stopped raising concerns about potential system failures because "nobody would have listened anyway," representing a critical breakdown in organizational resilience. This wasn't burnout but the collapse of adaptive capacity—the ability to respond to unexpected events and continuously adjust. Management had metrics for everything except their team's willingness to adapt. The engineering environment systematically undermined resilience through unexplained feature changes, accumulating technical debt while "efficiency metrics" looked great, rigid planning processes that prevented addressing early warning signs, and decision-makers isolated from technical realities. This illustrates the fundamental tension between efficiency (which demands standardization) and resilience (which requires flexibility and overcapacity), with organizations unconsciously sacrificing the latter until something breaks catastrophically.
Read the post on Linkedin
The prevention paradox creates a challenging cycle: when preventive measures successfully stop disasters, their very success makes them appear unnecessary. This was evident with the Y2K bug—after billions spent on prevention resulted in no major incidents, many dismissed the original concerns as overblown rather than recognizing that prevention worked. Our brains struggle with valuing "non-events" and suffer from hindsight bias, making us more likely to conclude "it wasn't really a problem" than "our precautions worked." Organizations can overcome this paradox by documenting the "alternate reality" through simulations, sharing near-miss stories, learning from others' failures, framing prevention costs against potential losses, creating visible milestones, building institutional memory about why protective measures exist, and regularly educating stakeholders on resilience value. The key challenge is making invisible prevention work visible and valued throughout the organization.
Read the post on Linkedin
🎙️ Podcast Episode with Steadybit
Steadybit co-founder Benjamin Wilms sat down with Adrian Hornsby to discuss why some teams struggle to get started with chaos engineering, common social challenges, and the impact that AI will have on reliability testing.
Link to Podcast