The Problem With "Doing DevOps"
Every engineering org we walk into says the same thing: "we do DevOps." Then we look at the pipeline and find 14 manual approval gates, a Jenkins box nobody owns, and a deploy.sh script with a TODO from 2022. DevOps is not a team, a tool, or a job title. It is a set of engineering practices that shorten the feedback loop between writing code and running it in production — safely, repeatedly, and without heroics.
This guide is for teams that have outgrown the script-and-pray stage but haven't yet built a delivery pipeline they trust. We'll skip the manifesto and get to what you actually need to build, in what order, and with which tools. The stack we reference is current as of early 2026.
Why It Still Matters In 2026
The 2025 DORA report once again showed the same pattern: elite performers deploy multiple times per day, recover from incidents in under an hour, and have change failure rates below 5%. Low performers deploy monthly, recover in days, and fail a third of the time. The gap between the two has nothing to do with team size or technology choice. It correlates almost entirely with four metrics:
- Deployment frequency — how often you ship code to production
- Lead time for changes — commit to production duration
- Change failure rate — percentage of deployments that cause incidents
- Mean time to recovery (MTTR) — how long to get back to green
If you measure nothing else, measure these four. They are the north star. Everything in this post either moves one of them or gets out of the way.
The Four Capability Loops
Think of DevOps as four nested loops, each short enough that a feedback signal returns before you forget why you started.
| Loop | Duration | What you learn |
|---|---|---|
| Local dev | seconds | Does my code compile and pass unit tests? |
| CI | minutes | Does it integrate with main? Any regression? |
| Pre-prod | tens of minutes | Does it pass integration and security tests? |
| Production | hours to days | Does it hold up under real traffic? |
A healthy org invests in all four. A broken org has a ten-minute local loop, a forty-minute CI, no pre-prod, and a terrifying production deploy. Fix the worst one first.
Capability 1: Source Control And Trunk-Based Development
Start with the boring stuff. If you have long-lived feature branches that sit for two weeks before merging, nothing downstream will save you. Trunk-based development with short-lived branches (< 24 hours) and feature flags is the only pattern we recommend for teams above five people.
Branching rules we enforce on every client:
mainis always deployable. If it isn't, stop the line.- PRs merge in under a day. If a PR sits for a week, split it.
- No direct pushes to
main. Everything goes through review and CI. - Feature flags (LaunchDarkly, Unleash, or OpenFeature) decouple deploy from release.
Capability 2: Continuous Integration
CI is the first place most teams get wrong. They either bolt on a single "run tests" job or build a 2000-line workflow that nobody understands. Aim for the middle.
Here is a minimal but production-ready GitHub Actions pipeline for a Node.js service. It caches dependencies, runs type checks, unit tests, security scans, builds a container, and pushes to a registry:
name: CI
on:
push:
branches: [main]
pull_request:
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
test:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: "pnpm"
- run: corepack enable
- run: pnpm install --frozen-lockfile
- run: pnpm run lint
- run: pnpm run typecheck
- run: pnpm run test --coverage
- uses: codecov/codecov-action@v4
security:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: aquasecurity/[email protected]
with:
scan-type: fs
severity: HIGH,CRITICAL
exit-code: "1"
build:
needs: [test, security]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-24.04
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-ci
aws-region: eu-west-1
- uses: aws-actions/amazon-ecr-login@v2
id: ecr
- uses: docker/build-push-action@v6
with:
push: true
tags: |
${{ steps.ecr.outputs.registry }}/api:${{ github.sha }}
${{ steps.ecr.outputs.registry }}/api:latest
cache-from: type=gha
cache-to: type=gha,mode=max
Two non-negotiables here. First, OIDC-based AWS auth — no long-lived access keys in secrets. Second, concurrency cancels superseded runs so a busy PR doesn't eat your runner budget.
Capability 3: Continuous Delivery And Deployment
CD is where "we automated the build" becomes "we automated the business risk." The distinction matters. Continuous delivery means every green commit could go to production. Continuous deployment means every green commit does go to production. Start with the former and graduate to the latter once you trust the signal.
Pick A Deployment Model
Three models dominate in 2026:
- GitOps with ArgoCD or Flux — declarative state in Git, reconciled by a controller. Our default for Kubernetes.
- Push-based CI/CD — the pipeline calls
kubectl applyor similar. Simpler, but auditability suffers. - PaaS abstractions — Cloud Run, Fly.io, Render. Great for small teams who want to stay out of the YAML business.
GitOps wins when you have multiple clusters, strict change control, or an auditor who wants to see who changed what and when.
Progressive Delivery By Default
Ship behind a canary or blue-green strategy from day one. Rolling update is the default in Kubernetes, but it gives you almost no safety. With Argo Rollouts you get a declarative canary in a few lines:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
replicas: 6
strategy:
canary:
canaryService: api-canary
stableService: api-stable
trafficRouting:
istio:
virtualService:
name: api
routes: [primary]
steps:
- setWeight: 10
- pause: { duration: 2m }
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause: { duration: 5m }
The analysis step queries Prometheus and auto-aborts the rollout if error rate climbs. This is the single highest-leverage change most teams can make: stop treating rollback as a manual escalation and let the pipeline do it for you.
Capability 4: Infrastructure As Code
If your production environment cannot be recreated from a Git repo, you do not have infrastructure — you have an artifact. IaC is the line between the two. We use Terraform for 90% of engagements, OpenTofu for clients who care about the fork, and Pulumi for shops that truly prefer programming languages.
Minimum viable IaC layout for a small team:
infra/
├── modules/
│ ├── network/
│ ├── eks/
│ └── rds/
├── envs/
│ ├── dev/
│ ├── staging/
│ └── prod/
└── .github/workflows/terraform.yaml
And here is a small module that creates an S3 bucket with sane defaults. Notice the validation, the versioning, the public-access block, and the tags — none of these are optional for a production bucket in 2026:
variable "name" {
type = string
description = "Bucket name suffix. Will be prefixed with environment."
validation {
condition = can(regex("^[a-z0-9-]{3,40}$", var.name))
error_message = "Name must be lowercase alphanumeric with dashes, 3-40 chars."
}
}
variable "environment" {
type = string
}
resource "aws_s3_bucket" "this" {
bucket = "${var.environment}-${var.name}"
force_destroy = var.environment != "prod"
tags = {
Environment = var.environment
ManagedBy = "terraform"
Module = "s3-bucket"
}
}
resource "aws_s3_bucket_versioning" "this" {
bucket = aws_s3_bucket.this.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_public_access_block" "this" {
bucket = aws_s3_bucket.this.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
bucket = aws_s3_bucket.this.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
Plan In PRs, Apply On Merge
Run terraform plan on every pull request and post the output as a comment. Only run terraform apply when the PR merges to main. This gives reviewers the same "what will actually change" view they get from a code diff.
Capability 5: Observability
You cannot operate what you cannot see. The three pillars are metrics, logs, and traces, but the honest priority order for a new team is:
- Structured logs first. If your logs aren't JSON, fix that before anything else.
- RED metrics second. Rate, Errors, Duration per endpoint. Prometheus plus a Grafana dashboard gets you 80% of the value in an afternoon.
- Traces when you have more than three services. OpenTelemetry SDK, OTLP exporter, and a backend of your choice (Tempo, Honeycomb, Datadog).
The single piece of telemetry code every service should ship with:
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: resourceFromAttributes({
[ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME ?? "api",
[ATTR_SERVICE_VERSION]: process.env.GIT_SHA ?? "dev",
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on("SIGTERM", () => {
sdk.shutdown().finally(() => process.exit(0));
});
Once traces flow, set up SLOs. A 99.9% availability SLO on a public API with a 30-day window gives you a 43-minute error budget. If you burn through it, stop shipping features and fix the reliability debt. This is the single most powerful lever for aligning product and SRE priorities.
Culture: The Part You Cannot terraform apply
Every engineering leader we've worked with has underestimated the cultural side of the transition. Tools are the easy part. The hard part is getting a team to agree that:
- Shared on-call is not punishment; it's feedback.
- Postmortems are blameless, written, and read by everyone.
- "It works on my machine" is a bug report, not an excuse.
- Dev owns the production behavior of their code.
You cannot buy this. You can only model it. The fastest way we've seen orgs move is when engineering leadership takes the first on-call shift with the team, owns the first postmortem, and publicly thanks the person who found their own bug.
A Twelve-Week Starter Plan
For a team that is starting from close to zero, here is what we'd sequence.
Weeks 1-2: Baseline. Measure the four DORA metrics. You cannot fix what you don't measure. Also audit every manual step from commit to prod — write it on a whiteboard.
Weeks 3-4: CI. Pick one service. Build a green CI with tests, lint, typecheck, security scan, container build. Nothing else yet.
Weeks 5-6: CD to staging. Automate the deploy to a staging environment. If you don't have staging, create it with IaC.
Weeks 7-8: IaC for one environment. Rebuild staging from Terraform. This will uncover a lot of configuration drift. Fix it.
Weeks 9-10: Observability. Structured logs, RED metrics, one SLO, one alert page. Route alerts to the team's PagerDuty or Opsgenie.
Weeks 11-12: Prod. Promote the pipeline to production. Gate with canary. Run a game day. Celebrate the first automated rollback.
Next Steps
DevOps is not complete in twelve weeks — it is never complete. But in twelve weeks a team can go from "we deploy on Friday and pray" to "we deploy hourly and the pipeline catches its own mistakes." From there, the next frontier is usually platform engineering: turning this pipeline into a golden path that other teams can consume.
If you want help getting from zero to trunk-based, production-grade CD on a real stack, get in touch. We do this work alongside your team and hand it over once you own it.