New Webinar: Modernising Without Destabilising: How Bread Financial Is Building Confidence Through Change

Learn more

New webinar with Bread Financial

Learn more
Contact us

Blogs

Improve the visibility of your systems in 3 steps

<span id="hs_cos_wrapper_name" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="text" >Improve the visibility of your systems in 3 steps</span>

Date 29 June 2026

Author Team Capacitas

LI Button-1 Twitter Facebook-share-icon

Teams are always looking to get better visibility of their systems. The first step on this journey is to invest in APM tools and/or build in-house monitoring tools to gather data. But we often find that despite making significant investments, CIOs and their teams feel they didn’t get the visibility they expected.

How should they fix this? We recommend this simple three-step plan:

  1. Focus on key metrics
  2. Move away from thresholds
  3. Don't explain away

 

1. Focus on key metrics

In one major Telco organisation, they had 250,000 dashboards, yet service incidents continued and costs kept on rising. The teams were being drowned in data and alerts. It was difficult to see what the problems were and really understand the system.

The key is to focus on key metrics. For a start, many of your metrics will simply be telling you the same story. Why look at hundreds of metrics when you can just look at one and it tells you everything you need to know?

Other metrics may be useful for deep dive root cause analysis, but you don’t need them in order to understand the problem.

 

2. Move away from thresholds

Thresholds are reactive, they don't give enough of an early warning sign before incidents happen. To counteract this ops teams have to lower the thresholds, in order to give more time to react. The challenge here is that the number of alerts starts increasing leading to alerts overload, as it’s impossible to tell which alerts really matter.

There’s an increasing availability of Analytic techniques to look for changes from normal behaviour. These have evolved as AI/ML techniques. AWS, new relic provide these features. They are a huge improvement from threshold alerting but still suffer from too many alerts due to the number of metrics and getting the tuning of the sensitivity of these alerts right.

 

3. Don’t explain away

Even where you have a focus on key metrics and use of advanced analytic techniques, early warning signs can still be ignored or missed. While working with a major SAAS provider analytics showed that the response times were increasing in variance over time.

The engineers saw it as only being a few milliseconds and thought it was due to changes in the user behaviour or workload. This seemed to explain the problem and it was quickly forgotten. Several months down the line the service experienced an outage and prolonged service degradation that lasted nearly a week. The early warning signs had been missed.

Observability is to interpret what is happening underneath the covers. If the behaviour isn’t as expected don’t jump to conclusions too quickly. Instead, use simple models to test out your theory (see next).

 

Predict – what will happen next?

By this stage you are collecting the right metrics, you have moved away from simplistic thresholds, and are no longer explaining away anomalies. But you can't just investigate every anomaly. This is where you need to use smart modelling techniques to predict what will happen next and assess its seriousness.

Even using a simple M/M/N response time chart and comparing your response time metrics to that can be incredibly powerful in assessing the level of risk associated with the anomaly. If it's deviated very far from normal behaviour you can start doing a deeper dive using more metrics and traces from your APM/monitoring tools.

After the deep dive you can start asking insightful questions on how can you improve both systems and teams' working practices. You’re now using data to give you early warning signs to prevent incidents, and help improve the organisation. You can now see the full picture and have true visibility.

 


Schedule a Cloud Opportunity and Risk Assessment Call

View Free, Relevant Capacitas Insights

Whether you’re looking to optimise costs, improve agility or drive value creation, our expert insights can help you. Ready to start?

Explore Capacitas Insights

 

Team Capacitas
About the author

Team Capacitas

Capacitas is a cloud and AI value partner. We translate rapid technological change into enduring commercial advantage by converting every unit of compute into enterprise value.

FinOps and AI: Building the Financial Discipline for the Next Wave of Enterprise Intelligence

AI FinOps represents an evolution rather than a replacement of traditional FinOps. It extends the model into a domain where financial, technical, and product decisions are tightly interconnected.

Read insight

Confidence Under Load: How We Verified AKS Readiness for Peak

How Capacitas verified AKS readiness for peak demand by validating workload performance, autoscaling, cluster capacity, monitoring, and incident response.

Read insight

Building Cloud Resilience: Lessons from the AWS Outage

Learning from the Latest Outage. Events like this week’s AWS disruption highlight one clear truth: resilience must be designed, not assumed.

Read insight

Bringing Order to Chaos: A Practical Guide to Chaos Testing in the Cloud

In today’s cloud-native environments, resilience is not optional—it’s critical. Chaos testing has emerged as a key practice for validating system behaviour under failure conditions.

Read insight