Resilience Bites #1 - What the Internet said last month!


Welcome to the first edition of Resilience Bites!

For years, I’ve been fascinated by resilience engineering. I spend a good chunk of my time exploring the vast corners of the internet, learning, and staying updated on the latest in the field. I thought I’d share some of the best things I find with you.

In this issue of Resilience Bites, I’ve pulled together some of the most interesting things the internet had to say about resilience last month.

I’m also planning to dive deeper into specific topics and answer questions you send my way. This is a work in progress, so feel free to help steer the ship in a direction that’s most helpful for you.

There’s a lot to explore, from launches to lessons learned from outages, and I’d like to share a few personal favorites.

First up is the launch of the Resilience in Software Foundation. It’s interesting to see a dedicated space for the community to grow and share knowledge about resilient engineering. I’m curious to see how it evolves and where it goes, though I do wonder how the price tag might affect participation.

On a more practical note, Simon Hanmer’s blog about chaos engineering with Amazon’s Fault Injection Service is a great starting point if you’re looking to get hands-on FIS.

The ChatGPT outage postmortem is another highlight. It’s a good reminder that even the best systems can have bad days, and what really matters is how we learn from those moments. I also enjoyed reading some personal reflections from folks like Adam Rogers and Joe Fabisevich—they bring a human perspective to the challenges we face in software development.

For those working with AWS, there are some helpful posts about monitoring latency with Amazon CloudWatch and improving reliability with Amazon DynamoDB. If you’re curious about what’s new, Amazon Aurora DSQL is an exciting development in the world of databases, offering a serverless option that’s designed for high availability.

The deep dive into Celery tasks resilience by Mathias Millet was a standout for me, as it brought back memories of honing my skills with that technology years ago. It’s always exciting to see how the tools I’ve worked with continue to evolve.

Finally, there are plenty of upcoming conferences and events that you won’t want to miss. Whether you’re into chaos engineering, site reliability, or just want to keep up with the latest in resilience, there’s something for everyone.

I hope you find these highlights useful and maybe even a bit inspiring.

Happy reading!

Must-check highlights

  • Podcast: Episode 5 of This is Fine! - Curating Your Resilience Engineering 101 - Getting started with resilience engineering.

  • Launch of the Resilience in Software Foundation - News post

  • Postmortem: ChatGPT outage - High error rates for ChatGPT, APIs, and Sora

  • Blog: Chaos in the Cloud - An Introduction to Chaos Engineering and Amazon's Fault Injection Service - from Simon Hanmer.

  • “Right. The years and billions of dollars spent preparing are why Y2K didn’t “live up to the hype.” They *fixed* it. Before it happened. Which is good. Yes.” - ‪Adam Rogers

  • “I really question how good the software we're writing is every time I watch my mother in law have to stop using an app because of a bad bug. Software is hard, but if the best we've got is fixing problems by force quitting an app or rebooting a device, then we're really not doing our jobs well.” - Joe Fabisevich

  • “In 1994, a math professor discovered that Intel's Pentium chip sometimes gave the wrong answer when dividing. Fixing this "FDIV" bug cost Intel $475 million. I analyzed the Pentium chip and found the bug. 1/N” - Ken Shirriff ‬

  • AWS Blog: Understanding and monitoring latency for Amazon EBS volumes using Amazon CloudWatch

  • AWS Blog: Enhance the reliability of airlines’ mission-critical baggage handling using Amazon DynamoDB

  • Blog: Quick takes on the recent OpenAI public incident write-up by Lorin Hochstein

  • AWS Builder's Library: Resilience lessons from the lunch rush by Mike Haken

  • InfoQ - Key Trends from 2024: Cell-based Architecture, DORA & SPACE, LLM & SLM, Cloud Databases and Portals

  • InfoQ - Presentation - How to Architect Software for a Greener Future by Sara Bergman at QCon London

  • InfoQ - Presentation - Empirical Observations on the the Future of Scalable UI Architecture by Willian Martins at InfoQ Dev Summit Boston

  • Blog: A Deep Dive into Celery Task Resilience, Beyond Basic Retries by Mathias Millet

  • Blog: TDD and the Zero-Defects Myth by Marco Cecconi

  • Report: 2024 Accelerate State of DevOps Report Shows Pros and Cons of AI

  • Blog: Snapshot Isolation vs Serializability by Marc Brooker

  • Blog: Whither dashboard design? by Lorin Hochstein


Video of the week

Amazon Aurora DSQL is the new serverless distributed SQL database from AWS launched at reInvent 2024. It is an active-active distributed architecture with strong data consistency designed for 99.99% single-Region and 99.999% multi-Region availability.

If you want to learn more about Amazon DSQL, these accompanying blog posts from Marc Brooker are well-worth the read.


Upcoming Conferences


Interesting jobs

  • AWS - Sr. Technical Program Manager, AWS Reliability Services (my team). Job link

  • AWS - Senior Research Scientist, AWS Incident Tooling & Response. Job link

  • AWS - Software Development Manager, AWS Incident Tooling & Response. Job link

  • Capital One - Director, Chief of Staff - Resilience Engineering. Job link

    Please send your job posting for inclusion in the next newsletter.