Resilience Bites #5 - LinkedIn Rewind (week 10)

Resilience Bites

Mar 16

March 16th, 2025

Hi everyone,

Welcome to my weekly LinkedIn recap!

I'm thrilled by the positive feedback from last week's LinkedIn Rewind newsletter. It seems many of you found it helpful to catch up on content you might have missed in your feed.

Let's dive right into this week 10's highlights!

The Linkedin Roundup

On trade-offs

With Kubernetes approaching its 10-year anniversary, many organizations are reconsidering its use, mirroring the microservices pendulum swing where teams rushed to break up monoliths only to later realize they had traded one set of problems for another. This cycle persists because engineers consistently overestimate the benefits of new solutions while underestimating future drawbacks—current frustrations feel immediate and concrete, while potential problems with alternatives remain abstract and easy to discount. Breaking this pattern requires acknowledging that every architectural choice involves trade-offs rather than seeking perfection, documenting not just decisions but their context and reasoning, and embracing balanced approaches that combine strengths from different patterns rather than treating architectural choices as binary options.

Synthetic monitoring

Organizations should proactively detect system issues rather than learning about them from user complaints or social media. Synthetic monitoring achieves this by continuously running automated scripts that simulate real user journeys through production environments, actively validating system health from the outside-in. Unlike integration tests that run once per deployment, synthetics operate continuously under changing real-world conditions and should be aligned with system fault isolation boundaries (typically availability zones or regions in cloud environments). For maximum effectiveness, synthetic tests should run from outside the cloud provider itself, use appropriate test frequencies, implement progressive alerting to prevent false alarms, and be combined with real user monitoring, chaos engineering, and load testing to create a comprehensive observability strategy.

On maintainability

Resilient systems aren't defined by their ability to prevent failures but by how quickly and predictably they recover when things go wrong. Like the Jeep—beloved for its maintainability rather than perfection—software systems should be designed with recovery in mind from the beginning. This requires simplifying recovery paths with minimal dependencies, deploying with easy rollback capabilities, minimizing startup time for components, regularly practicing incident response procedures, and optimizing for observability so teams can quickly identify abnormal conditions. The key insight is that maintainability isn't optional but essential—it's what enables systems to recover rapidly from inevitable failures, making it perhaps the most overlooked yet critical aspect of building truly resilient systems.

The cost of outages

Major UK banks experienced 803 hours of technology outages over two years, costing them millions in customer compensation and damaging their reputation. Despite these consequences, many organizations still take a reactive approach to system resilience, waiting for failures before applying hasty fixes. Proactive resilience strategies—including chaos experiments, operational readiness reviews, comprehensive incident response plans, and designing systems with built-in redundancy—are far more effective and ultimately less expensive than the reactive model. The true cost of outages extends beyond direct compensation to include lost business, damaged customer trust, and increased regulatory scrutiny, making resilience investment not just beneficial but essential, particularly for financial institutions with complex interconnected systems.

Security chaos engineering

Security and resilience are increasingly recognized as interconnected disciplines rather than separate concerns, particularly in financial services. Security Chaos Engineering represents this shift from purely preventative cybersecurity to comprehensive cyber resilience—designing systems that can withstand, recover from, and adapt to both accidental and malicious disruptions. Many resilience principles directly apply to security: emergent system behaviors matter more than component-level protections, default configurations create vulnerabilities in both domains, and the normalization of deviance affects security practices just as it does operational ones. Organizations face identical barriers implementing cyber resilience as they do with general resilience, suggesting these artificially separated disciplines should be integrated since both ultimately ensure systems maintain functionality and protect value when things go wrong.

"I don't think we need chaos engineering"

"I don't think we need chaos engineering" likely means missing out on reduced downtime costs, better incident response capabilities, and improved system resilience. Without it, organizations forfeit opportunities to increase customer satisfaction, build confidence in their systems, and achieve faster recovery times during incidents. Chaos engineering reveals unknown issues, tests monitoring and alerting systems, exposes hidden dependencies and single points of failure, and helps teams develop adaptive capacity. It improves system design, enhances post-incident analysis, strengthens communication during crises, and reduces cognitive biases in planning. Most critically, it helps prevent catastrophic failures that can severely damage both finances and reputation - all benefits that become apparent only after experiencing the consequences of not having implemented it.

🎙️ New Amer's Podcast Episode

The episode explores how system resilience requires more than just technical solutions, focusing on three critical pillars: culture, tools, and processes. It examines how AI is creating new resilience challenges, the relationship between security and resilience, and why the "prevention paradox" makes justifying resilience investments difficult. The key insight is that most organizations mistakenly attempt to solve resilience problems through tools alone, overlooking the essential cultural and procedural elements that create truly resilient systems.

resiliencechaos engineeringengineeringcloud computingawssoftwaredistributed systems

Adrian Hornsby

Resilience Bites #5 - LinkedIn Rewind (week 10)

The Linkedin Roundup

Resilience Bites #6 - LinkedIn Rewind (week 12)

Resilience Bites #4 - LinkedIn Rewind (week 9)

adhorn.me