Signs of a Good SRE Practice

<span id="hs_cos_wrapper_name" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="text" >Signs of a Good SRE Practice</span>

Date 24 August 2023

Author Frank Warren

Site Reliability Engineering (SRE) is an integrated process or practice in Agile development that strives to ensure that operational integrity is maintained by the DevSecOps development process in a scalable way.

SRE is primarily focused on ensuring that Business SLA’s are sustainable and exploit Service Level Objective (SLO) achievement as a meter of targeting effort and defining success. When working well SRE empowers Dev teams to deliver the speed/quality/cost requirements effectively and efficiently through the DevSecOps process. 

But what does a good SRE practice look like?

SRE Practice is not fully mature

Adoption of SRE in organisations is at best patchy. A Dynatrace study in 2022 assessed that over 80% or organisations classed as not fully mature. The opportunities area to promote better investment are rich. 

Dynatrace State of SRE Report

Source: Dynatrace State of SRE Report 2022

Working with the SRE teams of our clients we identified a few common behaviours from not fully mature teams:

Performance, Reliability, Recoverability and Capacity get no focus. These are key areas of SLA/SLO that need to be taken into consideration. 
Integration with cloud services and observability are limited. Core data metrics are not captured nor visible therefore, making operational risk difficult to identify and act upon. Examples of this include cloud metrics for Azure and AWS services which have a high dependency.
Cloud costs are uncontrolled and increasing rapidly.
No automation frameworks are in place to assist Dev teams with the increasing pace of change objectives. Releases frequently overrun impacting time to market.

In our experience, these behaviours can be turned around in a manner of months into a good internal SRE practice that the business will benefit from.

Signs of a good SRE practice

SRE engineers collaborate with developers to deliver stable, resilient, performant, and secure architectures that meet non-requirements set out in SLO’s and SLA’s. Development teams invariably prioritise business change and often spend insufficient time planning for these failure scenarios. SRE bridges this gap from an architectural and testing perspective and exploits lessons learnt from production incidents to fix problems. 
SRE improves time-to-market by eliminating manual tasks. Examples of this include development, deployment, testing and monitoring frameworks. This ensures that consistent practices are delivered repeatedly through common toolsets & automation thus avoiding duplication of error-prone human intervention.
SRE believe that observability is key and provides the capability to visualise operational and SLA performance. Decisions taken are justified by data and use the analysis of this to drive out how the transformation should be implemented. Examples of this include DORA metrics, Incident data, SLO/SLI metrics (Resource and Performance). Immature development teams who tend not to conduct this analysis are encouraged to adopt improved observability practices across all environments. 
SRE often provide testing specialisms & skills that development teams cannot easily acquire. Examples include Performance, Resilience & recoverability, and penetration testing.  
Investment in SRE reduces the organisational silos that have historically arisen between development and Ops teams. The shared responsibility and highly collaborative approach between SRE & development supports a more effective and agile approach to the delivery of change.

About the Author

Frank Warren

Frank is a Principal Consultant specialising in capacity planning, performance engineering and cloud cost optimisation. Frank leads numerous high profile ecommerce clients, helping them achieve their business peaks while savings on cloud costs and improving performance.

If you would like to have a chat about optimising your cloud bill, feel free to reach out for a no commitment chat. You can contact us via the website at https://www.capacitas.co.uk/book-a-diagnostic-session or reach out via email at contact@capacitas.co.uk

Also worth having a look at some of our recent case studies where we have saved our clients Millions of pounds in cloud spend.

About the author

Frank Warren

Frank is a highly experienced Capacity Management and Performance Assurance practitioner who has delivered services to companies across many sectors, including Royal Bank of Scotland, National Westminster Bank, J Sainsburys PLC, London Electricity PLC, and Data Networks PLC.

FinOps and AI: Building the Financial Discipline for the Next Wave of Enterprise Intelligence

AI FinOps represents an evolution rather than a replacement of traditional FinOps. It extends the model into a domain where financial, technical, and product decisions are tightly interconnected.

Read insight

Confidence Under Load: How We Verified AKS Readiness for Peak

How Capacitas verified AKS readiness for peak demand by validating workload performance, autoscaling, cluster capacity, monitoring, and incident response.

Read insight

Building Cloud Resilience: Lessons from the AWS Outage

Learning from the Latest Outage. Events like this week’s AWS disruption highlight one clear truth: resilience must be designed, not assumed.

Read insight

Bringing Order to Chaos: A Practical Guide to Chaos Testing in the Cloud

In today’s cloud-native environments, resilience is not optional—it’s critical. Chaos testing has emerged as a key practice for validating system behaviour under failure conditions.

Read insight

Contact us

Blogs

Signs of a Good SRE Practice

SRE Practice is not fully mature

Signs of a good SRE practice

About the author

Frank Warren

FinOps and AI: Building the Financial Discipline for the Next Wave of Enterprise Intelligence

Confidence Under Load: How We Verified AKS Readiness for Peak

Building Cloud Resilience: Lessons from the AWS Outage

Bringing Order to Chaos: A Practical Guide to Chaos Testing in the Cloud