New Webinar: Modernising Without Destabilising: How Bread Financial Is Building Confidence Through Change

Learn more

New webinar with Bread Financial

Learn more
Contact us

Blogs

Building Cloud Resilience: Lessons from the AWS Outage

<span id="hs_cos_wrapper_name" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="text" >Building Cloud Resilience: Lessons from the AWS Outage</span>

Date 30 June 2026

Author Team Capacitas

When AWS stumbled this week, much of the internet stumbled with it. Apps froze. Payments failed. Services timed out. For most people, it was an inconvenience. For the businesses that run on AWS, it was a high-stakes reminder of how fragile digital scale can be.

Even the most mature cloud platforms are not immune to failure. When something as fundamental as DNS falters, it can ripple across thousands of systems in seconds. The question is not whether it will happen again — it is how prepared you are when it does.

Complexity and Concentration

The modern cloud ecosystem is a feat of engineering and efficiency — but it is also built on shared dependencies. And that comes with both challenges and benefits.

A single region, a single misconfiguration, or a single provider can suddenly become everyone’s problem.

On the other hand, depending on a technology provider with vast amounts of expertise at their disposal can mean outages are less common, and resolved faster. During this outage, AWS’s engineers identified and fixed the issue quickly, and fixed it for everyone.

As Thomas Barns, Head of Service Design at Capacitas, put it:

“It is a lot easier when it is AWS’s engineers fixing it for you. If it’s your own data centre, that is a full-scale panic. But when half the world goes down together and comes back up quickly, the perception is different.”

The event is a reminder that the cloud gives resilience — but also centralises risk. Outsourcing infrastructure does not outsource responsibility. Every organisation still needs to understand how much downtime they can tolerate, what “good enough” looks like during disruption, and what they will do when the upstream fails.

Resilience Is an Investment, Not a Cost 

True optimisation is not about stripping systems down to the bare minimum. It is about balancing cost, performance, and resilience so that efficiency never comes at the expense of continuity.
That balance takes planning — modelling demand patterns, testing failover capacity, and validating performance under load.

As Barns explains:

“You can’t prepare for every scenario, but you can plan for the ones that matter. Sometimes the answer is not to keep everything running — it is knowing which parts must stay alive, even if that means scaling back functionality for a few hours.”

For example, he notes how Monzo Bank keeps basic payment functionality live even if other services fail — proof that resilience isn’t binary. It’s about clarity, not perfection.

When you treat resilience as part of your cost model, not as a bolt-on, you move from firefighting to foresight.

A Built-In Test of Your Systems

While most teams breathed a sigh of relief once AWS recovered, Barns calls the incident a “free resilience test”:

“This is the perfect time to look at your systems and ask what broke and why. You have just done some chaos testing by accident — so use the data. Did your alerts fire? Did your communication tools still work? Did you know what to do when they didn’t?”

For some, the only change might be ensuring alerts come from a separate region. For others, it might trigger a deeper rethink of architecture, risk appetite, and recovery time objectives.

The lesson? Do not waste the outage. The next few days of reflection can make the difference between a future crisis and a controlled recovery.

The Human Side of Recovery

Behind every outage are engineers triaging alerts, analysts piecing together root causes, and customer teams fielding questions they cannot yet answer.Resilience isn’t just about architecture — it’s also cultural.

Teams that plan, practice, and communicate recover faster because they already know what “good” looks like before things go wrong. Preparation turns chaos into process.

As Barns adds:

“Most of what your people have to do in a situation like this is not fixing the system — AWS’s engineers are already doing that. It is communication, reassurance, and calm coordination. That is what keeps users confident and businesses steady.”

Three Takeaways for the Week After an Outage

  • Be prepared, not immune.
    Outages happen — even to AWS. The goal is not elimination; it’s preparedness. Define what continuity means for your business.
  • Use this as a learning opportunity.
    Review what worked and what didn’t. Your alerting, communication, and recovery processes just had a live-fire test. Capture those lessons while they’re fresh.
  • Match risk to architecture.
    Know which regions you rely on, what your failover capabilities are, and whether they align with your risk appetite. The cheapest or default option might not be the safest one.

Learning from the Latest Outage

Events like this week’s AWS disruption highlight one clear truth: resilience must be designed, not assumed.

At Capacitas, we help enterprises engineer that resilience — building systems that perform predictably under pressure, scale efficiently, and recover fast when things go wrong.

Outages will always happen. But with the right data, design, and discipline, they do not have to become disasters.

See how Capacitas helps enterprises design and maintain high-performing, cost-efficient cloud environments:

Team Capacitas
About the author

Team Capacitas

Capacitas is a cloud and AI value partner. We translate rapid technological change into enduring commercial advantage by converting every unit of compute into enterprise value.

FinOps and AI: Building the Financial Discipline for the Next Wave of Enterprise Intelligence

AI FinOps represents an evolution rather than a replacement of traditional FinOps. It extends the model into a domain where financial, technical, and product decisions are tightly interconnected.

Read insight

Confidence Under Load: How We Verified AKS Readiness for Peak

How Capacitas verified AKS readiness for peak demand by validating workload performance, autoscaling, cluster capacity, monitoring, and incident response.

Read insight

Building Cloud Resilience: Lessons from the AWS Outage

Learning from the Latest Outage. Events like this week’s AWS disruption highlight one clear truth: resilience must be designed, not assumed.

Read insight

Bringing Order to Chaos: A Practical Guide to Chaos Testing in the Cloud

In today’s cloud-native environments, resilience is not optional—it’s critical. Chaos testing has emerged as a key practice for validating system behaviour under failure conditions.

Read insight