New Webinar: Modernising Without Destabilising: How Bread Financial Is Building Confidence Through Change

Learn more

New webinar with Bread Financial

Learn more
Contact us

Blogs

Dealing with global internet outages

<span id="hs_cos_wrapper_name" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="text" >Dealing with global internet outages</span>

Date 29 June 2026

Author Team Capacitas

Estimated read time: 3 Minutes

Author: Dr. Manzoor Mohammed

LI Button-1 Twitter Facebook-share-icon

While it’s hard to deal with these global cloud outages since the issue is often outside of your control, it’s how you plan and react to these events that make a difference.

Last Tuesday, 8th June 2021, Fastly (one of the top CDN providers worldwide) suffered a systems fault. It affected websites all over the internet for longer than an hour during which users were unable to access their favourite websites. The issue was apparently caused by an undiscovered software bug triggered by a valid customer configuration change.

Fastly was able to recover fast (49 minutes) according to Nick Rockwell, SVP of Engineering and Infrastructure. Unfortunately, their customers took a little longer to fully recover.

Paypal, for example, took a further 40 minutes to recover after Fastly restored the service. This extra recovery time would have cost approximately $106M. The BBC, by contrast, recovered quicker and it wasn’t just by having an alternative CDN provider.

Here’s what they probably did and what the other affected parts of the internet could have done to prepare:

  1. Detect the problem quickly
  2. Have an alternative cloud provider (if possible)
  3. Prepare a resilience plan and execute it
  4. Provision sufficient capacity to deal with a flood of non-cached content with the recovery (e.g. flood of non-cached content)

1. Detect the problem quickly

This is where real-time monitoring and alerting is critical. While in this case, the outage would have been obvious, in many others, the early warning signs before the outage can be missed by alert-based thresholds.

One client I worked with had 250,000+ dashboards. Despite the large coverage, they often missed the early warning signs of incidents. Many tools now provide more sophisticated alerting systems to mitigate this. But even using these more advanced tools, there's a risk that they are not tuned appropriately, leading to false alerts, missed early warning signs or both.

 

2. Have an alternative cloud provider

Don't put your eggs in one basket.

- Old proverb

 

This proverb is especially true when you are running business-critical systems. Companies such as the BBC had an alternative CDN provider to whom they could switch.

Having said this, it is not the most practical or cost-effective method for everyone. It must be a critical business decision to add this level of complexity to your tech stack.

 

3. Prepare a resilience plan and execute it

Simply having an alternative cloud provider doesn't mean you will automatically switch to it in the event of the primary provider's failure. If it is not automatic, you need a tried and tested plan for how to switch to your alternative provider.

Even if it is automatic, you still need to test it and make sure it will work as expected. There is nothing like a bit of chaos engineering to give you confidence.

 

4. Leave sufficient capacity to deal with the recovery

Whether or not you have an alternative, you still need to consider the capacity of your platform in the event of CDN failures because of increased demand post-recovery. During the CDN failure and recovery period, customers who could have gone to the edge will instead go to the origin.

This only works if there is sufficient capacity to support the non-cached traffic. That may be one explanation for why companies such as Paypal had recovery times lasting far longer than the outage window.

 

 

 

 

 

Team Capacitas
About the author

Team Capacitas

Capacitas is a cloud and AI value partner. We translate rapid technological change into enduring commercial advantage by converting every unit of compute into enterprise value.

FinOps and AI: Building the Financial Discipline for the Next Wave of Enterprise Intelligence

AI FinOps represents an evolution rather than a replacement of traditional FinOps. It extends the model into a domain where financial, technical, and product decisions are tightly interconnected.

Read insight

Confidence Under Load: How We Verified AKS Readiness for Peak

How Capacitas verified AKS readiness for peak demand by validating workload performance, autoscaling, cluster capacity, monitoring, and incident response.

Read insight

Building Cloud Resilience: Lessons from the AWS Outage

Learning from the Latest Outage. Events like this week’s AWS disruption highlight one clear truth: resilience must be designed, not assumed.

Read insight

Bringing Order to Chaos: A Practical Guide to Chaos Testing in the Cloud

In today’s cloud-native environments, resilience is not optional—it’s critical. Chaos testing has emerged as a key practice for validating system behaviour under failure conditions.

Read insight