Why "Rolling Update" Is Not Enough
The default deployment strategy in Kubernetes is RollingUpdate, and most teams stop there. It's better than a full outage, but it's still a bad default for production. It has no automated health check beyond the readiness probe, no traffic-shaping, no automated rollback, and no way to validate the new version under real load before shifting all users to it. If the new pod is broken in a way that only shows up at 10% traffic, you'll find out at 100% traffic.
Real zero-downtime deployments require two things working together: a traffic strategy (blue-green or canary) and a health strategy (automated analysis that can abort the rollout). This post is about how to build both properly with Argo Rollouts, Kubernetes, and a service mesh or ingress you probably already run.
Blue-Green: The Switch
Blue-green runs two complete environments side by side. Blue serves production traffic while Green receives the new version. You deploy, smoke test against an internal service name, then flip a single label or router rule and Green becomes production. The old Blue is kept warm for instant rollback.
When Blue-Green Wins
- Stateful tests are needed before go-live. You can run full smoke/e2e suites against Green using real infrastructure.
- Instant rollback is a hard requirement. A label flip is faster than any canary rewind.
- The cost of a small-percentage bad rollout is unacceptable. Payment flows, auth, critical internal APIs.
When It Hurts
- You cannot afford 2x the capacity for the duration of the switch. Blue-green doubles your pod count during the transition.
- You have database migrations that aren't backward compatible. Both Blue and Green share the same DB. If Green needs a schema change, Blue may break.
- Your service is large. Provisioning a full Green can take minutes of scheduling time.
Blue-Green With Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments
spec:
replicas: 10
strategy:
blueGreen:
activeService: payments-active
previewService: payments-preview
autoPromotionEnabled: false
scaleDownDelaySeconds: 600
prePromotionAnalysis:
templates:
- templateName: smoke-tests
args:
- name: service-name
value: payments-preview
postPromotionAnalysis:
templates:
- templateName: error-rate
selector:
matchLabels:
app: payments
template:
metadata:
labels:
app: payments
spec:
containers:
- name: api
image: ghcr.io/example/payments:1.14.0
ports:
- containerPort: 8080
Two important properties here. autoPromotionEnabled: false requires a manual promotion (or an AnalysisRun passing) before the switch happens. scaleDownDelaySeconds: 600 keeps the old ReplicaSet alive for 10 minutes after promotion — that's your rollback window.
Canary: The Gradient
Canary releases send a small slice of traffic to the new version, watch the metrics, and expand gradually. Done right, it's the safest strategy in most production environments. Done wrong (no metrics, no automated abort, manual weight bumps), it's worse than rolling update because the engineer babysitting it just clicks "promote" at every step.
When Canary Wins
- High traffic services. 10% of a million requests per minute is plenty of signal to detect a regression.
- Cost sensitivity. You run 110% capacity during the rollout, not 200%.
- Gradual risk exposure. Great for user-facing changes where "looks fine internally" isn't enough.
When It Hurts
- Low traffic services. 10% of 50 rpm is 5 rpm. That's not enough signal to detect anything in five minutes. For those, blue-green or plain rolling update is fine.
- Hard-to-isolate metrics. If you can't compute per-version error rate, canary analysis is a lie.
Canary With Argo Rollouts And Istio
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
replicas: 8
strategy:
canary:
canaryService: api-canary
stableService: api-stable
trafficRouting:
istio:
virtualService:
name: api
routes: [primary]
steps:
- setWeight: 5
- pause: { duration: 2m }
- analysis:
templates:
- templateName: success-rate
- templateName: latency-p99
args:
- name: service-name
value: api-canary
- setWeight: 25
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate
- templateName: latency-p99
args:
- name: service-name
value: api-canary
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 100
Notice how the pauses are short at the start and longer later. The first minutes of a canary are the most informative — if the new version is going to crash or throw 500s, you want to see it immediately.
Automated Analysis: The Part That Actually Matters
Neither strategy is safe without automated, metric-driven abort. Argo Rollouts uses AnalysisTemplate to query Prometheus (or Datadog, New Relic, CloudWatch, etc.) and fail the rollout if the signal is bad.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 30s
count: 6
successCondition: result[0] >= 0.99
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",code!~"5.."}[1m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[1m]))
And a latency template using the same pattern:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-p99
spec:
args:
- name: service-name
metrics:
- name: latency-p99
interval: 30s
count: 6
successCondition: result[0] < 0.4
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(
0.99,
sum by (le) (
rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[1m])
)
)
If either fails, the Rollout aborts and restores stable traffic without a human in the loop. This is the entire point.
Database Migrations: The Silent Killer
The single most common reason "zero-downtime" turns into "an outage" is a database migration that assumed the old code was already gone. Both blue-green and canary serve old and new application code simultaneously. Your schema must work for both at the same time.
The Expand-And-Contract Pattern
Never ship a breaking migration in a single release. Split it into:
- Expand. Add the new column/table/index. Old code ignores it. New code writes to both old and new.
- Backfill. Copy data from old to new. Runs as a background job.
- Migrate reads. New code reads from the new column. Old code still reads the old one.
- Contract. Remove the old column in a later release, once no running pod uses it.
Yes, this is four releases instead of one. It is also the difference between a planned change and a 2am incident.
-- Release N: expand
ALTER TABLE users ADD COLUMN email_normalized TEXT;
CREATE INDEX CONCURRENTLY idx_users_email_norm ON users(email_normalized);
-- Release N (app code): dual-write
INSERT INTO users (email, email_normalized) VALUES ($1, LOWER($1));
-- Release N+1: backfill
UPDATE users SET email_normalized = LOWER(email) WHERE email_normalized IS NULL;
-- Release N+2: read from new column
SELECT id FROM users WHERE email_normalized = $1;
-- Release N+3: contract
ALTER TABLE users DROP COLUMN email;
Connection Draining And Graceful Shutdown
Even with perfect traffic shifting, a pod that dies mid-request drops that request. Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds, then sends SIGKILL. Your app needs to:
- Fail its readiness probe immediately (so it's removed from endpoints).
- Continue serving in-flight requests.
- Close the HTTP listener and drain connections.
- Exit cleanly before the grace period ends.
Minimal Node.js shutdown handler:
import http from "node:http";
import express from "express";
const app = express();
let ready = true;
app.get("/ready", (_, res) => (ready ? res.sendStatus(200) : res.sendStatus(503)));
app.get("/live", (_, res) => res.sendStatus(200));
const server = http.createServer(app);
server.listen(8080);
const shutdown = async () => {
console.log("SIGTERM received, draining");
ready = false;
// Give load balancers time to observe unreadiness
await new Promise((r) => setTimeout(r, 5000));
server.close((err) => {
if (err) {
console.error("Shutdown error", err);
process.exit(1);
}
process.exit(0);
});
// Hard timeout
setTimeout(() => process.exit(1), 25000).unref();
};
process.on("SIGTERM", shutdown);
process.on("SIGINT", shutdown);
And the matching pod spec:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: api
lifecycle:
preStop:
exec:
command: ["sleep", "5"]
readinessProbe:
httpGet: { path: /ready, port: 8080 }
periodSeconds: 2
failureThreshold: 2
The preStop hook gives the service mesh a moment to propagate the removal before the process starts shutting down. Small thing, huge difference in production.
Choosing Between The Two
| Criterion | Blue-Green | Canary |
|---|---|---|
| Traffic model | All-or-nothing | Gradual percentage |
| Rollback speed | Instant (label flip) | Seconds (traffic reset) |
| Extra capacity | 2x during switch | 1.1x during rollout |
| Real user validation | Only after switch | During rollout |
| Minimum useful traffic | Any | ~100 rpm+ |
| DB migration tolerance | Harder (both envs share DB) | Same |
| Complexity | Lower | Higher (needs metrics) |
In practice, we recommend canary by default for any service with enough traffic for the analysis to be meaningful, and blue-green for low-traffic critical services (auth, payments, webhooks) where full pre-promotion smoke tests are more valuable than gradual exposure.
Next Steps
Shipping without fear requires three things: a traffic strategy, automated metric-driven abort, and a database discipline that assumes two code versions will run concurrently. The YAML is easy. The discipline is the hard part. Start by adopting expand-and-contract on your next schema change, add an AnalysisTemplate that queries your existing Prometheus, and run your first canary on a non-critical service. If you want help putting a progressive delivery platform in place across multiple services, get in touch.