Resilience Bites #1 - What the Internet said last month!
Welcome to the first edition of Resilience Bites!
For years, I’ve been fascinated by resilience engineering. I spend a good chunk of my time exploring the vast corners of the internet, learning, and staying updated on the latest in the field. I thought I’d share some of the best things I find with you.
In this issue of Resilience Bites, I’ve pulled together some of the most interesting things the internet had to say about resilience last month.
I’m also planning to dive deeper into specific topics and answer questions you send my way. This is a work in progress, so feel free to help steer the ship in a direction that’s most helpful for you.
There’s a lot to explore, from launches to lessons learned from outages, and I’d like to share a few personal favorites.
First up is the launch of the Resilience in Software Foundation. It’s interesting to see a dedicated space for the community to grow and share knowledge about resilient engineering. I’m curious to see how it evolves and where it goes, though I do wonder how the price tag might affect participation.
On a more practical note, Simon Hanmer’s blog about chaos engineering with Amazon’s Fault Injection Service is a great starting point if you’re looking to get hands-on FIS.
The ChatGPT outage postmortem is another highlight. It’s a good reminder that even the best systems can have bad days, and what really matters is how we learn from those moments. I also enjoyed reading some personal reflections from folks like Adam Rogers and Joe Fabisevich—they bring a human perspective to the challenges we face in software development.
For those working with AWS, there are some helpful posts about monitoring latency with Amazon CloudWatch and improving reliability with Amazon DynamoDB. If you’re curious about what’s new, Amazon Aurora DSQL is an exciting development in the world of databases, offering a serverless option that’s designed for high availability.
The deep dive into Celery tasks resilience by Mathias Millet was a standout for me, as it brought back memories of honing my skills with that technology years ago. It’s always exciting to see how the tools I’ve worked with continue to evolve.
Finally, there are plenty of upcoming conferences and events that you won’t want to miss. Whether you’re into chaos engineering, site reliability, or just want to keep up with the latest in resilience, there’s something for everyone.
I hope you find these highlights useful and maybe even a bit inspiring.
Happy reading!
Must-check highlights
Podcast: Episode 5 of This is Fine! - Curating Your Resilience Engineering 101 - Getting started with resilience engineering.
Launch of the Resilience in Software Foundation - News post
Postmortem: ChatGPT outage - High error rates for ChatGPT, APIs, and Sora
Blog: Chaos in the Cloud - An Introduction to Chaos Engineering and Amazon's Fault Injection Service - from Simon Hanmer.
“Right. The years and billions of dollars spent preparing are why Y2K didn’t “live up to the hype.” They *fixed* it. Before it happened. Which is good. Yes.” - Adam Rogers
“I really question how good the software we're writing is every time I watch my mother in law have to stop using an app because of a bad bug. Software is hard, but if the best we've got is fixing problems by force quitting an app or rebooting a device, then we're really not doing our jobs well.” - Joe Fabisevich
“In 1994, a math professor discovered that Intel's Pentium chip sometimes gave the wrong answer when dividing. Fixing this "FDIV" bug cost Intel $475 million. I analyzed the Pentium chip and found the bug. 1/N” - Ken Shirriff
AWS Blog: Understanding and monitoring latency for Amazon EBS volumes using Amazon CloudWatch
AWS Blog: Enhance the reliability of airlines’ mission-critical baggage handling using Amazon DynamoDB
Blog: Quick takes on the recent OpenAI public incident write-up by Lorin Hochstein
AWS Builder's Library: Resilience lessons from the lunch rush by Mike Haken
InfoQ - Key Trends from 2024: Cell-based Architecture, DORA & SPACE, LLM & SLM, Cloud Databases and Portals
InfoQ - Presentation - How to Architect Software for a Greener Future by Sara Bergman at QCon London
InfoQ - Presentation - Empirical Observations on the the Future of Scalable UI Architecture by Willian Martins at InfoQ Dev Summit Boston
Blog: A Deep Dive into Celery Task Resilience, Beyond Basic Retries by Mathias Millet
Blog: TDD and the Zero-Defects Myth by Marco Cecconi
Report: 2024 Accelerate State of DevOps Report Shows Pros and Cons of AI
Blog: Snapshot Isolation vs Serializability by Marc Brooker
Blog: Whither dashboard design? by Lorin Hochstein
Video of the week
Amazon Aurora DSQL is the new serverless distributed SQL database from AWS launched at reInvent 2024. It is an active-active distributed architecture with strong data consistency designed for 99.99% single-Region and 99.999% multi-Region availability.
If you want to learn more about Amazon DSQL, these accompanying blog posts from Marc Brooker are well-worth the read.
Upcoming Conferences
Chaos Carnival 2025 https://www.papercall.io/chaoscarnival2025
Conf42.com Chaos Engineering 2025 https://www.papercall.io/conf42-chaos-engineering-2025
SREday New York City 2025 Q1 https://www.papercall.io/sreday-2025-nyc-q1
SREday London 2025 Q1 https://www.papercall.io/sreday-2025-london-q1
SREday San Francisco 2025 Q2 https://www.papercall.io/sreday-2025-san-francisco-q2
Conf42.com Site Reliability Engineering (SRE) 2025 https://www.papercall.io/conf42-site-reliability-engineering-sre-2025
Regional Scrum Gathering Dhaka 2025 https://www.papercall.io/dhakaregional
DevOpsDays Medellin 2025 https://www.papercall.io/dodmde2025
SREday Cologne 2025 Q2 https://www.papercall.io/sreday-2025-cologne-q2
Conf42.com Incident Management 2025 https://www.papercall.io/conf42-incident-management-2025
Interesting jobs
AWS - Sr. Technical Program Manager, AWS Reliability Services (my team). Job link
AWS - Senior Research Scientist, AWS Incident Tooling & Response. Job link
AWS - Software Development Manager, AWS Incident Tooling & Response. Job link
Capital One - Director, Chief of Staff - Resilience Engineering. Job link
Please send your job posting for inclusion in the next newsletter.