Why Your Cluster Is Bleeding Money
The dirty secret of Kubernetes economics is that most clusters run at 15-25% real utilization while being billed as though they were pegged. We've audited dozens of production clusters for clients in the last year and the pattern is identical: requests set to double or triple what the pod ever actually uses, a fleet of large on-demand nodes, and a bill that grows faster than headcount. A typical engagement cuts spend by 35-45% in the first month without changing a single line of application code.
This post is the playbook we use. It's ordered the way you should execute it — from the changes that land the biggest wins to the long-tail optimizations that matter once you're past the low-hanging fruit.
Measure Before You Cut
You cannot optimize what you cannot attribute. Before touching a single request, install something that gives you per-namespace and per-workload cost visibility. In 2026 the two options that matter are:
- OpenCost — the CNCF project that grew out of Kubecost. Open source, vendor neutral, integrates with Prometheus.
- Kubecost — the commercial build of the same codebase with hosted options, alerts, and team features.
Either one takes about an hour to deploy with Helm:
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm upgrade --install opencost opencost/opencost \
--namespace opencost --create-namespace \
--set opencost.exporter.defaultClusterId=prod-eu-west-1 \
--set opencost.prometheus.internal.enabled=true
Point it at your Prometheus, wait 24 hours for data, then build a Grafana dashboard with cost per namespace, idle cost, and pod efficiency (actual / requested). If a namespace is under 30% efficient, it's a target.
Step 1: Right-Size Requests And Limits
Over-requested CPU and memory is the single largest cost driver we see. Engineers copy-paste requests from an old chart, multiply by a safety factor "just in case," and ship it. The cluster dutifully provisions nodes to satisfy those requests.
Use VPA In Recommendation Mode
Do not enable VPA in Auto mode on stateful workloads — it will restart pods to resize them. Use Off mode to get recommendations without mutation:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
namespace: prod
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2
memory: 4Gi
After a week, query the VPA recommendations with kubectl get vpa api-vpa -o yaml and compare against current requests. The delta is your savings.
Rules Of Thumb
- CPU requests should target the p95 of real usage. CPU is compressible — the pod gets throttled, not killed, when it exceeds requests.
- Memory requests should target the p99 or the max observed. Memory is not compressible — hitting the limit means OOMKill.
- CPU limits should usually be unset. Setting CPU limits below the node's capacity causes throttling even when the host is idle. The only time to set them is for workloads you want to cap for billing or fairness reasons.
- Memory limits should equal requests for predictable behavior. This gives the pod a Guaranteed QoS class.
A Pod With Sensible Resource Config
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
containers:
- name: api
image: ghcr.io/example/api:1.12.0
resources:
requests:
cpu: 150m
memory: 256Mi
limits:
memory: 256Mi
readinessProbe:
httpGet: { path: /ready, port: 8080 }
livenessProbe:
httpGet: { path: /live, port: 8080 }
Step 2: Replace Cluster Autoscaler With Karpenter
If you're on EKS and still using Cluster Autoscaler with pre-created node groups, you're leaving money on the table. Karpenter looks at pending pods, picks the cheapest instance type that fits them, and provisions it in ~30 seconds. It also consolidates: when a node becomes underutilized, Karpenter moves the pods to a smaller instance and terminates the old one.
Minimal NodePool with mixed spot and on-demand, capacity-optimized allocation, and consolidation enabled:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["m7i", "m7a", "c7i", "r7i"]
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values: ["nano", "micro", "small"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: "2000"
memory: 4000Gi
Two details that matter. First, allowing both amd64 and arm64 lets Karpenter pick Graviton nodes when a workload supports them — Graviton is typically 15-20% cheaper per vCPU. Second, WhenEmptyOrUnderutilized is the new consolidation mode that actively shrinks the cluster, not just when a node is fully empty.
Step 3: Put The Right Workloads On Spot
Spot instances are 60-90% cheaper than on-demand. The fear is interruption: AWS gives you a two-minute notice and reclaims the node. That fear is usually overblown in 2026 — interruption rates on the common families (m7i, c7i, r7i) often sit under 5% per month.
What Goes On Spot
| Workload type | Spot safe? | Notes |
|---|---|---|
| Stateless HTTP APIs | Yes | Run enough replicas and set PDBs |
| Batch jobs | Yes | Use checkpoints; restart on interruption |
| CI runners | Yes | Job-level retry handles eviction |
| Dev/staging | Yes | Obvious |
| Stateful databases | No | Use on-demand or managed RDS/Aurora |
| Leader-elected controllers | Caution | Make sure multiple replicas span zones |
Protect Availability With PodDisruptionBudgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api
spec:
minAvailable: 50%
selector:
matchLabels:
app: api
Combine with topology spread so your replicas don't all land on the same spot node:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api
Step 4: Autoscale The Workloads, Not Just The Nodes
Node autoscaling is necessary but not sufficient. If a Deployment has 20 replicas at 3am when it could have 4, you're paying for idle pods. HPA based on CPU is the default, but custom metrics are where the real savings live.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "200"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
For event-driven workloads (queue consumers, Kafka processors), KEDA is the right answer. It scales on external metrics like SQS depth, Kafka lag, or Postgres row count — including scale-to-zero.
Step 5: Namespace Quotas And LimitRanges
Give every team a budget. Without quotas, one misbehaving team can eat the whole cluster and your whole bill.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-growth-quota
namespace: team-growth
spec:
hard:
requests.cpu: "40"
requests.memory: 80Gi
limits.memory: 80Gi
pods: "100"
---
apiVersion: v1
kind: LimitRange
metadata:
name: defaults
namespace: team-growth
spec:
limits:
- type: Container
default:
cpu: 200m
memory: 256Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "4"
memory: 8Gi
The LimitRange is the quiet hero here — it gives sane defaults to any pod that forgets to specify them, so you don't get "unlimited" sneaking in.
Step 6: Showback And Chargeback
The last 10% of savings comes from cultural pressure, not configuration. Once you have OpenCost pulling real numbers, send every engineering team a weekly cost email. Not an alert, not a dashboard — a calm summary saying "your namespace cost $4,213 last week, here are the top 5 expensive workloads." Teams will optimize their own code when the number is visible and tied to them.
A simple PromQL query to pull per-namespace cost:
sum by (namespace) (
kubecost_namespace_monthly_cost{cluster="prod-eu-west-1"}
)
Wire that into a scheduled Grafana report or a small Python script and email it every Monday.
A Real Before/After
Here is a redacted summary from a recent client engagement — a Series B SaaS on EKS running about 400 pods across prod, staging, and dev clusters.
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly EKS spend | $48,200 | $28,900 | -40% |
| Node count (prod) | 32 | 18 | -44% |
| Avg CPU utilization | 14% | 46% | +3.3x |
| Avg memory utilization | 22% | 61% | +2.8x |
| Spot ratio | 0% | 65% | - |
The entire project took six weeks. Two of those were Karpenter rollout, two were right-sizing the 40 highest-cost workloads, and two were quota policy and showback setup.
The Short Checklist
If you do nothing else this quarter:
- Install OpenCost. Get per-namespace numbers.
- Right-size the top 10 most expensive workloads using VPA recommendations.
- Deploy Karpenter with consolidation enabled.
- Move stateless and batch workloads to spot.
- Set ResourceQuotas and LimitRanges on every namespace.
- Set up weekly cost emails to team leads.
Next Steps
Cost optimization is not a one-time project — cluster drift is real and a clean cluster becomes a messy one again in about a quarter unless you keep it honest. The teams that stay efficient treat it like security: automated checks in CI, regular audits, a named owner. If you'd like help putting this playbook into practice against your own cluster, get in touch.