Zero-Downtime Deployments: Blue-Green vs Canary Strategies

Why "Rolling Update" Is Not Enough

The default deployment strategy in Kubernetes is RollingUpdate, and most teams stop there. It's better than a full outage, but it's still a bad default for production. It has no automated health check beyond the readiness probe, no traffic-shaping, no automated rollback, and no way to validate the new version under real load before shifting all users to it. If the new pod is broken in a way that only shows up at 10% traffic, you'll find out at 100% traffic.

Real zero-downtime deployments require two things working together: a traffic strategy (blue-green or canary) and a health strategy (automated analysis that can abort the rollout). This post is about how to build both properly with Argo Rollouts, Kubernetes, and a service mesh or ingress you probably already run.

Blue-Green: The Switch

Blue-green runs two complete environments side by side. Blue serves production traffic while Green receives the new version. You deploy, smoke test against an internal service name, then flip a single label or router rule and Green becomes production. The old Blue is kept warm for instant rollback.

When Blue-Green Wins

Stateful tests are needed before go-live. You can run full smoke/e2e suites against Green using real infrastructure.
Instant rollback is a hard requirement. A label flip is faster than any canary rewind.
The cost of a small-percentage bad rollout is unacceptable. Payment flows, auth, critical internal APIs.

When It Hurts

You cannot afford 2x the capacity for the duration of the switch. Blue-green doubles your pod count during the transition.
You have database migrations that aren't backward compatible. Both Blue and Green share the same DB. If Green needs a schema change, Blue may break.
Your service is large. Provisioning a full Green can take minutes of scheduling time.

Blue-Green With Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments
spec:
  replicas: 10
  strategy:
    blueGreen:
      activeService: payments-active
      previewService: payments-preview
      autoPromotionEnabled: false
      scaleDownDelaySeconds: 600
      prePromotionAnalysis:
        templates:
          - templateName: smoke-tests
        args:
          - name: service-name
            value: payments-preview
      postPromotionAnalysis:
        templates:
          - templateName: error-rate
  selector:
    matchLabels:
      app: payments
  template:
    metadata:
      labels:
        app: payments
    spec:
      containers:
        - name: api
          image: ghcr.io/example/payments:1.14.0
          ports:
            - containerPort: 8080

Two important properties here. autoPromotionEnabled: false requires a manual promotion (or an AnalysisRun passing) before the switch happens. scaleDownDelaySeconds: 600 keeps the old ReplicaSet alive for 10 minutes after promotion — that's your rollback window.

Canary: The Gradient

Canary releases send a small slice of traffic to the new version, watch the metrics, and expand gradually. Done right, it's the safest strategy in most production environments. Done wrong (no metrics, no automated abort, manual weight bumps), it's worse than rolling update because the engineer babysitting it just clicks "promote" at every step.

When Canary Wins

High traffic services. 10% of a million requests per minute is plenty of signal to detect a regression.
Cost sensitivity. You run 110% capacity during the rollout, not 200%.
Gradual risk exposure. Great for user-facing changes where "looks fine internally" isn't enough.

When It Hurts

Low traffic services. 10% of 50 rpm is 5 rpm. That's not enough signal to detect anything in five minutes. For those, blue-green or plain rolling update is fine.
Hard-to-isolate metrics. If you can't compute per-version error rate, canary analysis is a lie.

Canary With Argo Rollouts And Istio

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 8
  strategy:
    canary:
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        istio:
          virtualService:
            name: api
            routes: [primary]
      steps:
        - setWeight: 5
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency-p99
            args:
              - name: service-name
                value: api-canary
        - setWeight: 25
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency-p99
            args:
              - name: service-name
                value: api-canary
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100

Notice how the pauses are short at the start and longer later. The first minutes of a canary are the most informative — if the new version is going to crash or throw 500s, you want to see it immediately.

Automated Analysis: The Part That Actually Matters

Neither strategy is safe without automated, metric-driven abort. Argo Rollouts uses AnalysisTemplate to query Prometheus (or Datadog, New Relic, CloudWatch, etc.) and fail the rollout if the signal is bad.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 30s
      count: 6
      successCondition: result[0] >= 0.99
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",code!~"5.."}[1m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[1m]))

And a latency template using the same pattern:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-p99
spec:
  args:
    - name: service-name
  metrics:
    - name: latency-p99
      interval: 30s
      count: 6
      successCondition: result[0] < 0.4
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(
              0.99,
              sum by (le) (
                rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[1m])
              )
            )

If either fails, the Rollout aborts and restores stable traffic without a human in the loop. This is the entire point.

Database Migrations: The Silent Killer

The single most common reason "zero-downtime" turns into "an outage" is a database migration that assumed the old code was already gone. Both blue-green and canary serve old and new application code simultaneously. Your schema must work for both at the same time.

The Expand-And-Contract Pattern

Never ship a breaking migration in a single release. Split it into:

Expand. Add the new column/table/index. Old code ignores it. New code writes to both old and new.
Backfill. Copy data from old to new. Runs as a background job.
Migrate reads. New code reads from the new column. Old code still reads the old one.
Contract. Remove the old column in a later release, once no running pod uses it.

Yes, this is four releases instead of one. It is also the difference between a planned change and a 2am incident.

-- Release N: expand
ALTER TABLE users ADD COLUMN email_normalized TEXT;
CREATE INDEX CONCURRENTLY idx_users_email_norm ON users(email_normalized);

-- Release N (app code): dual-write
INSERT INTO users (email, email_normalized) VALUES ($1, LOWER($1));

-- Release N+1: backfill
UPDATE users SET email_normalized = LOWER(email) WHERE email_normalized IS NULL;

-- Release N+2: read from new column
SELECT id FROM users WHERE email_normalized = $1;

-- Release N+3: contract
ALTER TABLE users DROP COLUMN email;

Connection Draining And Graceful Shutdown

Even with perfect traffic shifting, a pod that dies mid-request drops that request. Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds, then sends SIGKILL. Your app needs to:

Fail its readiness probe immediately (so it's removed from endpoints).
Continue serving in-flight requests.
Close the HTTP listener and drain connections.
Exit cleanly before the grace period ends.

Minimal Node.js shutdown handler:

import http from "node:http";
import express from "express";

const app = express();
let ready = true;
app.get("/ready", (_, res) => (ready ? res.sendStatus(200) : res.sendStatus(503)));
app.get("/live", (_, res) => res.sendStatus(200));

const server = http.createServer(app);
server.listen(8080);

const shutdown = async () => {
  console.log("SIGTERM received, draining");
  ready = false;

  // Give load balancers time to observe unreadiness
  await new Promise((r) => setTimeout(r, 5000));

  server.close((err) => {
    if (err) {
      console.error("Shutdown error", err);
      process.exit(1);
    }
    process.exit(0);
  });

  // Hard timeout
  setTimeout(() => process.exit(1), 25000).unref();
};

process.on("SIGTERM", shutdown);
process.on("SIGINT", shutdown);

And the matching pod spec:

spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: api
      lifecycle:
        preStop:
          exec:
            command: ["sleep", "5"]
      readinessProbe:
        httpGet: { path: /ready, port: 8080 }
        periodSeconds: 2
        failureThreshold: 2

The preStop hook gives the service mesh a moment to propagate the removal before the process starts shutting down. Small thing, huge difference in production.

Choosing Between The Two

Criterion	Blue-Green	Canary
Traffic model	All-or-nothing	Gradual percentage
Rollback speed	Instant (label flip)	Seconds (traffic reset)
Extra capacity	2x during switch	1.1x during rollout
Real user validation	Only after switch	During rollout
Minimum useful traffic	Any	~100 rpm+
DB migration tolerance	Harder (both envs share DB)	Same
Complexity	Lower	Higher (needs metrics)

In practice, we recommend canary by default for any service with enough traffic for the analysis to be meaningful, and blue-green for low-traffic critical services (auth, payments, webhooks) where full pre-promotion smoke tests are more valuable than gradual exposure.

Next Steps

Shipping without fear requires three things: a traffic strategy, automated metric-driven abort, and a database discipline that assumes two code versions will run concurrently. The YAML is easy. The discipline is the hard part. Start by adopting expand-and-contract on your next schema change, add an AnalysisTemplate that queries your existing Prometheus, and run your first canary on a non-critical service. If you want help putting a progressive delivery platform in place across multiple services, get in touch.

filed under

deploymentsre

kubernetes

argocd

Zero-Downtime Deployments: Blue-Green vs Canary Strategies

Why "Rolling Update" Is Not Enough

Blue-Green: The Switch

When Blue-Green Wins

When It Hurts

Blue-Green With Argo Rollouts

Canary: The Gradient

When Canary Wins

When It Hurts

Canary With Argo Rollouts And Istio

Automated Analysis: The Part That Actually Matters

Database Migrations: The Silent Killer

The Expand-And-Contract Pattern

Connection Draining And Graceful Shutdown

Choosing Between The Two

Next Steps

Want our team to help with your infrastructure?

Address

Say Hello

Say Hello