Why FinOps treats the symptoms, not the cause

Written by Jason Cross | Aug 8, 2024 11:00:00 PM

When it comes to your cloud estate, building and maintaining a cost-efficient culture that continues to deliver is not always easy. It requires ongoing effort and a culture-wide acceptance of your chosen approach. And it is worth it, just look at the benefits: streamlined operations, available budget for re-investment or development of new services, satisfied customers, and ongoing reliable performance and service delivery.

FinOps has emerged as a guiding light for organisations navigating the complex financial landscape of the public cloud. It is a methodology that promises cost optimisation, improved accountability, and better cloud resource management. And it delivers on those promises – to a degree.

But if you are deeply entrenched in the cloud, you may have noticed something unsettling: FinOps often focuses on treating the symptoms of cloud cost overruns, rather than addressing the root causes.

The FinOps Band-Aid

I almost think of FinOps as a skilled doctor. It is adept at diagnosing the ailment – identifying which services are draining your budget, where inefficiencies lie, and which teams are overspending. It can even prescribe effective treatments: rightsizing instances, reserving capacity, or leveraging spot instances. But just like a doctor who only focuses on managing symptoms, FinOps can miss the underlying disease.

The Missing Pieces

Architectural issues

Typically issues stem from product performance and resource bloat – both of which affect the confidence of the operations team. However, the real culprit behind many cloud cost overruns is often architectural. Your cloud infrastructure may be inherently costly due to design choices made early on. Perhaps you are relying heavily on expensive managed services, or your application architecture is not well-suited for the cloud's elasticity, because it is part of a lift and shift migration from an on-prem data centre or just a poorly designed application.

FinOps tools are fantastic at identifying these issues. But they rarely provide the solutions. That is because architectural optimisation is a different beast altogether. It requires a deeper understanding of architecture – including of applications, workloads and the intricacies of cloud architecture, which are skills that are not typically found in solution architects. It is also about leveraging the flexibility of the cloud to your advantage, making CSP (Cloud Service Providers) pricing models work for you to deliver maximum value.

Lack of observability

The only way to know if you have the right level of resources is by having the correct visibility of your systems. How are they being utilised? Do you see underutilisation or is it about right? Do they have a regular usage pattern? Is this usage pattern changing over time? Is it impacted by seasonality?

All these questions can be answered by having the right observability in place. It is not just about considering technical metrics either. A common mistake I see over and over is engineers sizing resources for spikes in CPU utilisation or IOPS, assuming it is a valid workload, or the spike is driven by business demand. Often though there is no correlation between the business demand and usage.

For example, you have a workload that generally runs at 25% CPU utilisation, but there is a daily spike of 80% utilisation that lasts for 30 minutes. That daily CPU spike turns out to be a backup job. Your system is currently sized for that backup. The current backup window is three hours, yet it is completing in 30 minutes. There would be an opportunity to downsize, which would increase utilisation, increase the length of the backup job, and reduce costs. But that is OK, because we are matching deployed resources with the business requirement. Not downsizing because you see 80% utilisation, without understanding the cause of that spike and the business value it adds, is a mistake, that leads to increased costs.

The frustrations of a symptom-focused approach

When organisations rely solely on FinOps to manage cloud costs, they often find themselves in a frustrating cycle:

Surprise bill: A monthly cloud bill arrives that is much higher than expected.
FinOps investigation: The FinOps team analyses the bill, identifies the culprits, and implements cost-saving measures.
Temporary relief: Costs decrease for a brief period.
Repeat: The next month brings another surprise bill, and the cycle begins anew.

This reactive approach is exhausting and ultimately unsustainable. It is like constantly putting out fires without ever addressing the faulty wiring that is causing them. Surely, it is better to fix the faulty wiring instead of firefighting all the time. It is also a lot safer.

Moving beyond FinOps: A deeper & more thoughtful approach?

To truly conquer cloud costs, you need a more thoughtful approach that combines the strengths of FinOps with observability and business metrics to provide a complete picture.

Here is what that looks like:

Embrace FinOps: Continue using FinOps tools and methodologies to gain visibility into your cloud spending and identify areas for improvement.
Invest in cloud expertise: Build a team (or partner with experts) who deeply understand cloud architecture and can assess your infrastructure's cost-efficiency. Many organisations have very capable cloud engineers, who don’t have time to care about costs. Their focus is on keeping the show on the road and developing new features. They may have FinOps experts who understand the theory of FinOps and are good with numbers but struggle to make an impact due to their focus on reporting symptoms. I believe is it key to find engineers that have great cloud awareness, understanding how the services work and their associated costs or have the two groups (Cloud Engineers and FinOps Practitioners) work closely together.
Leverage observability to understand what is happening and act: Conduct periodic reviews of your cloud architecture, looking for opportunities to optimise costs through refactoring, redesign, or better service selection. Remember, the cloud is constantly evolving, and so are the workloads running on them. So, what was optimal six months ago, may not be now.
Prioritise cost efficiency in design: Make cost efficiency a primary design consideration from the outset, at the requirements stage. Choose architectures and services that align with your budget and performance goals. Do not choose the gold service if the bronze service meets the non-function requirements. Think carefully about your resilience and high availability requirements. What you build should be based on sensible RPO and RTO. For example, if your RTO is four hours, do you really require primary node and two secondary nodes in the same region with an additional secondary nodes in another region in an always-on configuration?

The Path to Long-Term Cloud Cost Control

FinOps is undeniably valuable. It provides essential visibility and control over your cloud spending. But it is not a silver bullet. To achieve long-term, sustainable cloud cost control, you need to go deeper. By addressing the root causes of cloud cost overruns through architectural optimisation, you can break the cycle of reactive cost management and build a cloud environment that is both efficient and affordable.

To find out more about our approach to cloud cost optimisation, or to take a deeper dive into how to make it work for your organisation, download our latest whitepaper.

View full post