Loading...
All Articles
Kubernetes · 8 min read

Kubernetes Cost Optimization: Reduce Your Cloud Bill by 40%

A systematic approach to cutting Kubernetes spend: right-sizing with VPA, Karpenter consolidation, spot workloads, namespace quotas, and showback with OpenCost.

Why Your Cluster Is Bleeding Money

The dirty secret of Kubernetes economics is that most clusters run at 15-25% real utilization while being billed as though they were pegged. We've audited dozens of production clusters for clients in the last year and the pattern is identical: requests set to double or triple what the pod ever actually uses, a fleet of large on-demand nodes, and a bill that grows faster than headcount. A typical engagement cuts spend by 35-45% in the first month without changing a single line of application code.

This post is the playbook we use. It's ordered the way you should execute it — from the changes that land the biggest wins to the long-tail optimizations that matter once you're past the low-hanging fruit.

Measure Before You Cut

You cannot optimize what you cannot attribute. Before touching a single request, install something that gives you per-namespace and per-workload cost visibility. In 2026 the two options that matter are:

  • OpenCost — the CNCF project that grew out of Kubecost. Open source, vendor neutral, integrates with Prometheus.
  • Kubecost — the commercial build of the same codebase with hosted options, alerts, and team features.

Either one takes about an hour to deploy with Helm:

helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm upgrade --install opencost opencost/opencost \
  --namespace opencost --create-namespace \
  --set opencost.exporter.defaultClusterId=prod-eu-west-1 \
  --set opencost.prometheus.internal.enabled=true

Point it at your Prometheus, wait 24 hours for data, then build a Grafana dashboard with cost per namespace, idle cost, and pod efficiency (actual / requested). If a namespace is under 30% efficient, it's a target.

Step 1: Right-Size Requests And Limits

Over-requested CPU and memory is the single largest cost driver we see. Engineers copy-paste requests from an old chart, multiply by a safety factor "just in case," and ship it. The cluster dutifully provisions nodes to satisfy those requests.

Use VPA In Recommendation Mode

Do not enable VPA in Auto mode on stateful workloads — it will restart pods to resize them. Use Off mode to get recommendations without mutation:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: prod
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi

After a week, query the VPA recommendations with kubectl get vpa api-vpa -o yaml and compare against current requests. The delta is your savings.

Rules Of Thumb

  • CPU requests should target the p95 of real usage. CPU is compressible — the pod gets throttled, not killed, when it exceeds requests.
  • Memory requests should target the p99 or the max observed. Memory is not compressible — hitting the limit means OOMKill.
  • CPU limits should usually be unset. Setting CPU limits below the node's capacity causes throttling even when the host is idle. The only time to set them is for workloads you want to cap for billing or fairness reasons.
  • Memory limits should equal requests for predictable behavior. This gives the pod a Guaranteed QoS class.

A Pod With Sensible Resource Config

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
        - name: api
          image: ghcr.io/example/api:1.12.0
          resources:
            requests:
              cpu: 150m
              memory: 256Mi
            limits:
              memory: 256Mi
          readinessProbe:
            httpGet: { path: /ready, port: 8080 }
          livenessProbe:
            httpGet: { path: /live, port: 8080 }

Step 2: Replace Cluster Autoscaler With Karpenter

If you're on EKS and still using Cluster Autoscaler with pre-created node groups, you're leaving money on the table. Karpenter looks at pending pods, picks the cheapest instance type that fits them, and provisions it in ~30 seconds. It also consolidates: when a node becomes underutilized, Karpenter moves the pods to a smaller instance and terminates the old one.

Minimal NodePool with mixed spot and on-demand, capacity-optimized allocation, and consolidation enabled:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m7i", "m7a", "c7i", "r7i"]
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano", "micro", "small"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: "2000"
    memory: 4000Gi

Two details that matter. First, allowing both amd64 and arm64 lets Karpenter pick Graviton nodes when a workload supports them — Graviton is typically 15-20% cheaper per vCPU. Second, WhenEmptyOrUnderutilized is the new consolidation mode that actively shrinks the cluster, not just when a node is fully empty.

Step 3: Put The Right Workloads On Spot

Spot instances are 60-90% cheaper than on-demand. The fear is interruption: AWS gives you a two-minute notice and reclaims the node. That fear is usually overblown in 2026 — interruption rates on the common families (m7i, c7i, r7i) often sit under 5% per month.

What Goes On Spot

Workload typeSpot safe?Notes
Stateless HTTP APIsYesRun enough replicas and set PDBs
Batch jobsYesUse checkpoints; restart on interruption
CI runnersYesJob-level retry handles eviction
Dev/stagingYesObvious
Stateful databasesNoUse on-demand or managed RDS/Aurora
Leader-elected controllersCautionMake sure multiple replicas span zones

Protect Availability With PodDisruptionBudgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api
spec:
  minAvailable: 50%
  selector:
    matchLabels:
      app: api

Combine with topology spread so your replicas don't all land on the same spot node:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: api

Step 4: Autoscale The Workloads, Not Just The Nodes

Node autoscaling is necessary but not sufficient. If a Deployment has 20 replicas at 3am when it could have 4, you're paying for idle pods. HPA based on CPU is the default, but custom metrics are where the real savings live.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "200"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60

For event-driven workloads (queue consumers, Kafka processors), KEDA is the right answer. It scales on external metrics like SQS depth, Kafka lag, or Postgres row count — including scale-to-zero.

Step 5: Namespace Quotas And LimitRanges

Give every team a budget. Without quotas, one misbehaving team can eat the whole cluster and your whole bill.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-growth-quota
  namespace: team-growth
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 80Gi
    limits.memory: 80Gi
    pods: "100"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: defaults
  namespace: team-growth
spec:
  limits:
    - type: Container
      default:
        cpu: 200m
        memory: 256Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      max:
        cpu: "4"
        memory: 8Gi

The LimitRange is the quiet hero here — it gives sane defaults to any pod that forgets to specify them, so you don't get "unlimited" sneaking in.

Step 6: Showback And Chargeback

The last 10% of savings comes from cultural pressure, not configuration. Once you have OpenCost pulling real numbers, send every engineering team a weekly cost email. Not an alert, not a dashboard — a calm summary saying "your namespace cost $4,213 last week, here are the top 5 expensive workloads." Teams will optimize their own code when the number is visible and tied to them.

A simple PromQL query to pull per-namespace cost:

sum by (namespace) (
  kubecost_namespace_monthly_cost{cluster="prod-eu-west-1"}
)

Wire that into a scheduled Grafana report or a small Python script and email it every Monday.

A Real Before/After

Here is a redacted summary from a recent client engagement — a Series B SaaS on EKS running about 400 pods across prod, staging, and dev clusters.

MetricBeforeAfterChange
Monthly EKS spend$48,200$28,900-40%
Node count (prod)3218-44%
Avg CPU utilization14%46%+3.3x
Avg memory utilization22%61%+2.8x
Spot ratio0%65%-

The entire project took six weeks. Two of those were Karpenter rollout, two were right-sizing the 40 highest-cost workloads, and two were quota policy and showback setup.

The Short Checklist

If you do nothing else this quarter:

  1. Install OpenCost. Get per-namespace numbers.
  2. Right-size the top 10 most expensive workloads using VPA recommendations.
  3. Deploy Karpenter with consolidation enabled.
  4. Move stateless and batch workloads to spot.
  5. Set ResourceQuotas and LimitRanges on every namespace.
  6. Set up weekly cost emails to team leads.

Next Steps

Cost optimization is not a one-time project — cluster drift is real and a clean cluster becomes a messy one again in about a quarter unless you keep it honest. The teams that stay efficient treat it like security: automated checks in CI, regular audits, a named owner. If you'd like help putting this playbook into practice against your own cluster, get in touch.

filed under
kuberneteskubernetescloudcost-optimizationfinops
work with us

Want our team to help with your infrastructure?

talk to an engineerFree 30-min discovery callBook
close