In conversations with customers and network peers, many companies are considering setting up a dedicated SRE team or possibly looking to realign existing responsibilities. According to a report from Catchpoint, 50% of organisations have dedicated SRE teams or roles, and the number of vacancies for Service Reliability engineers has increased dramatically.
This supports the belief that system reliability, performance, and availability continue to be at the top of the key drivers for establishing an improved foundation of SRE practices.
Key drivers for an SRE practice
- The scale and complexity of IT Systems are key determinants. Increasing scale and complexity undoubtedly expose much more risk.
- Operational risks are not proactively mitigated through development and tend to be reactively resolved.
- The impact of operational failure on the business is substantial in terms of revenue loss and reputation.
- The frequency and severity of production incidents are high. Development teams are spending too much time firefighting. Incident management is not fixing issues properly.
- Service-Level Objectives (SLOs) for high-priority systems either do not exist or are not measured. Actionable insights are not being generated and operational issues are not exposed proactively. Management of SLOs is not happening.
- Production monitoring and alerting are not set up properly and this leads to poor insight on performance, availability and reliability risk. Reporting is very weak. There is little or no observability in test environments.
- Development teams miss chances to improve time to market and are not taking advantage of transformative activities such as automation frameworks, testing frameworks, deployment, and Infrastructure as Code. Releases are often overrun and the release cycle is slow.
- Non-functional testing (performance/scalability/efficiency, resilience/recovery, security) is executed poorly if at all, and is not underpinned by testing frameworks.
- Cross-functional collaboration between Service Management, Operations, and Development teams is poor and the benefits of close cooperation are not realised.
If any of these factors describe operational challenges you are experiencing then it might be time to examine your organisational capability and implement a remediation plan to plug key gaps.
About the Author
Frank Warren
Frank is a Principal Consultant specialising in capacity planning, performance engineering and cloud cost optimisation. Frank leads numerous high profile ecommerce clients, helping them achieve their business peaks while savings on cloud costs and improving performance.
Also worth having a look at some of our recent case studies where we have saved our clients Millions of pounds in cloud spend.
About the author
Team Capacitas
FinOps and AI: Building the Financial Discipline for the Next Wave of Enterprise Intelligence
AI FinOps represents an evolution rather than a replacement of traditional FinOps. It extends the model into a domain where financial, technical, and product decisions are tightly interconnected.
Confidence Under Load: How We Verified AKS Readiness for Peak
How Capacitas verified AKS readiness for peak demand by validating workload performance, autoscaling, cluster capacity, monitoring, and incident response.
Building Cloud Resilience: Lessons from the AWS Outage
Learning from the Latest Outage. Events like this week’s AWS disruption highlight one clear truth: resilience must be designed, not assumed.
Bringing Order to Chaos: A Practical Guide to Chaos Testing in the Cloud
In today’s cloud-native environments, resilience is not optional—it’s critical. Chaos testing has emerged as a key practice for validating system behaviour under failure conditions.


