Getting Started with DevOps: A Practical Guide

The Problem With "Doing DevOps"

Every engineering org we walk into says the same thing: "we do DevOps." Then we look at the pipeline and find 14 manual approval gates, a Jenkins box nobody owns, and a deploy.sh script with a TODO from 2022. DevOps is not a team, a tool, or a job title. It is a set of engineering practices that shorten the feedback loop between writing code and running it in production — safely, repeatedly, and without heroics.

This guide is for teams that have outgrown the script-and-pray stage but haven't yet built a delivery pipeline they trust. We'll skip the manifesto and get to what you actually need to build, in what order, and with which tools. The stack we reference is current as of early 2026.

Why It Still Matters In 2026

The 2025 DORA report once again showed the same pattern: elite performers deploy multiple times per day, recover from incidents in under an hour, and have change failure rates below 5%. Low performers deploy monthly, recover in days, and fail a third of the time. The gap between the two has nothing to do with team size or technology choice. It correlates almost entirely with four metrics:

Deployment frequency — how often you ship code to production
Lead time for changes — commit to production duration
Change failure rate — percentage of deployments that cause incidents
Mean time to recovery (MTTR) — how long to get back to green

If you measure nothing else, measure these four. They are the north star. Everything in this post either moves one of them or gets out of the way.

The Four Capability Loops

Think of DevOps as four nested loops, each short enough that a feedback signal returns before you forget why you started.

Loop	Duration	What you learn
Local dev	seconds	Does my code compile and pass unit tests?
CI	minutes	Does it integrate with main? Any regression?
Pre-prod	tens of minutes	Does it pass integration and security tests?
Production	hours to days	Does it hold up under real traffic?

A healthy org invests in all four. A broken org has a ten-minute local loop, a forty-minute CI, no pre-prod, and a terrifying production deploy. Fix the worst one first.

Capability 1: Source Control And Trunk-Based Development

Start with the boring stuff. If you have long-lived feature branches that sit for two weeks before merging, nothing downstream will save you. Trunk-based development with short-lived branches (< 24 hours) and feature flags is the only pattern we recommend for teams above five people.

Branching rules we enforce on every client:

main is always deployable. If it isn't, stop the line.
PRs merge in under a day. If a PR sits for a week, split it.
No direct pushes to main. Everything goes through review and CI.
Feature flags (LaunchDarkly, Unleash, or OpenFeature) decouple deploy from release.

Capability 2: Continuous Integration

CI is the first place most teams get wrong. They either bolt on a single "run tests" job or build a 2000-line workflow that nobody understands. Aim for the middle.

Here is a minimal but production-ready GitHub Actions pipeline for a Node.js service. It caches dependencies, runs type checks, unit tests, security scans, builds a container, and pushes to a registry:

name: CI
on:
  push:
    branches: [main]
  pull_request:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  test:
    runs-on: ubuntu-24.04
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "pnpm"
      - run: corepack enable
      - run: pnpm install --frozen-lockfile
      - run: pnpm run lint
      - run: pnpm run typecheck
      - run: pnpm run test --coverage
      - uses: codecov/codecov-action@v4

  security:
    runs-on: ubuntu-24.04
    steps:
      - uses: actions/checkout@v4
      - uses: aquasecurity/[email protected]
        with:
          scan-type: fs
          severity: HIGH,CRITICAL
          exit-code: "1"

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-24.04
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-ci
          aws-region: eu-west-1
      - uses: aws-actions/amazon-ecr-login@v2
        id: ecr
      - uses: docker/build-push-action@v6
        with:
          push: true
          tags: |
            ${{ steps.ecr.outputs.registry }}/api:${{ github.sha }}
            ${{ steps.ecr.outputs.registry }}/api:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

Two non-negotiables here. First, OIDC-based AWS auth — no long-lived access keys in secrets. Second, concurrency cancels superseded runs so a busy PR doesn't eat your runner budget.

Capability 3: Continuous Delivery And Deployment

CD is where "we automated the build" becomes "we automated the business risk." The distinction matters. Continuous delivery means every green commit could go to production. Continuous deployment means every green commit does go to production. Start with the former and graduate to the latter once you trust the signal.

Pick A Deployment Model

Three models dominate in 2026:

GitOps with ArgoCD or Flux — declarative state in Git, reconciled by a controller. Our default for Kubernetes.
Push-based CI/CD — the pipeline calls kubectl apply or similar. Simpler, but auditability suffers.
PaaS abstractions — Cloud Run, Fly.io, Render. Great for small teams who want to stay out of the YAML business.

GitOps wins when you have multiple clusters, strict change control, or an auditor who wants to see who changed what and when.

Progressive Delivery By Default

Ship behind a canary or blue-green strategy from day one. Rolling update is the default in Kubernetes, but it gives you almost no safety. With Argo Rollouts you get a declarative canary in a few lines:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 6
  strategy:
    canary:
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        istio:
          virtualService:
            name: api
            routes: [primary]
      steps:
        - setWeight: 10
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause: { duration: 5m }

The analysis step queries Prometheus and auto-aborts the rollout if error rate climbs. This is the single highest-leverage change most teams can make: stop treating rollback as a manual escalation and let the pipeline do it for you.

Capability 4: Infrastructure As Code

If your production environment cannot be recreated from a Git repo, you do not have infrastructure — you have an artifact. IaC is the line between the two. We use Terraform for 90% of engagements, OpenTofu for clients who care about the fork, and Pulumi for shops that truly prefer programming languages.

Minimum viable IaC layout for a small team:

infra/
├── modules/
│   ├── network/
│   ├── eks/
│   └── rds/
├── envs/
│   ├── dev/
│   ├── staging/
│   └── prod/
└── .github/workflows/terraform.yaml

And here is a small module that creates an S3 bucket with sane defaults. Notice the validation, the versioning, the public-access block, and the tags — none of these are optional for a production bucket in 2026:

variable "name" {
  type        = string
  description = "Bucket name suffix. Will be prefixed with environment."

  validation {
    condition     = can(regex("^[a-z0-9-]{3,40}$", var.name))
    error_message = "Name must be lowercase alphanumeric with dashes, 3-40 chars."
  }
}

variable "environment" {
  type = string
}

resource "aws_s3_bucket" "this" {
  bucket        = "${var.environment}-${var.name}"
  force_destroy = var.environment != "prod"

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
    Module      = "s3-bucket"
  }
}

resource "aws_s3_bucket_versioning" "this" {
  bucket = aws_s3_bucket.this.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_public_access_block" "this" {
  bucket                  = aws_s3_bucket.this.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
  bucket = aws_s3_bucket.this.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

Plan In PRs, Apply On Merge

Run terraform plan on every pull request and post the output as a comment. Only run terraform apply when the PR merges to main. This gives reviewers the same "what will actually change" view they get from a code diff.

Capability 5: Observability

You cannot operate what you cannot see. The three pillars are metrics, logs, and traces, but the honest priority order for a new team is:

Structured logs first. If your logs aren't JSON, fix that before anything else.
RED metrics second. Rate, Errors, Duration per endpoint. Prometheus plus a Grafana dashboard gets you 80% of the value in an afternoon.
Traces when you have more than three services. OpenTelemetry SDK, OTLP exporter, and a backend of your choice (Tempo, Honeycomb, Datadog).

The single piece of telemetry code every service should ship with:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME ?? "api",
    [ATTR_SERVICE_VERSION]: process.env.GIT_SHA ?? "dev",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().finally(() => process.exit(0));
});

Once traces flow, set up SLOs. A 99.9% availability SLO on a public API with a 30-day window gives you a 43-minute error budget. If you burn through it, stop shipping features and fix the reliability debt. This is the single most powerful lever for aligning product and SRE priorities.

Culture: The Part You Cannot `terraform apply`

Every engineering leader we've worked with has underestimated the cultural side of the transition. Tools are the easy part. The hard part is getting a team to agree that:

Shared on-call is not punishment; it's feedback.
Postmortems are blameless, written, and read by everyone.
"It works on my machine" is a bug report, not an excuse.
Dev owns the production behavior of their code.

You cannot buy this. You can only model it. The fastest way we've seen orgs move is when engineering leadership takes the first on-call shift with the team, owns the first postmortem, and publicly thanks the person who found their own bug.

A Twelve-Week Starter Plan

For a team that is starting from close to zero, here is what we'd sequence.

Weeks 1-2: Baseline. Measure the four DORA metrics. You cannot fix what you don't measure. Also audit every manual step from commit to prod — write it on a whiteboard.

Weeks 3-4: CI. Pick one service. Build a green CI with tests, lint, typecheck, security scan, container build. Nothing else yet.

Weeks 5-6: CD to staging. Automate the deploy to a staging environment. If you don't have staging, create it with IaC.

Weeks 7-8: IaC for one environment. Rebuild staging from Terraform. This will uncover a lot of configuration drift. Fix it.

Weeks 9-10: Observability. Structured logs, RED metrics, one SLO, one alert page. Route alerts to the team's PagerDuty or Opsgenie.

Weeks 11-12: Prod. Promote the pipeline to production. Gate with canary. Run a game day. Celebrate the first automated rollback.

Next Steps

DevOps is not complete in twelve weeks — it is never complete. But in twelve weeks a team can go from "we deploy on Friday and pray" to "we deploy hourly and the pipeline catches its own mistakes." From there, the next frontier is usually platform engineering: turning this pipeline into a golden path that other teams can consume.

If you want help getting from zero to trunk-based, production-grade CD on a real stack, get in touch. We do this work alongside your team and hand it over once you own it.

filed under

devopscicdautomationcultureinfrastructure

Getting Started with DevOps: A Practical Guide

The Problem With "Doing DevOps"

Why It Still Matters In 2026

The Four Capability Loops

Capability 1: Source Control And Trunk-Based Development

Capability 2: Continuous Integration

Capability 3: Continuous Delivery And Deployment

Pick A Deployment Model

Progressive Delivery By Default

Capability 4: Infrastructure As Code

Plan In PRs, Apply On Merge

Capability 5: Observability

Culture: The Part You Cannot `terraform apply`

A Twelve-Week Starter Plan

Next Steps

Want our team to help with your infrastructure?

Address

Say Hello

Say Hello

Getting Started with DevOps: A Practical Guide

The Problem With "Doing DevOps"

Why It Still Matters In 2026

The Four Capability Loops

Capability 1: Source Control And Trunk-Based Development

Capability 2: Continuous Integration

Capability 3: Continuous Delivery And Deployment

Pick A Deployment Model

Progressive Delivery By Default

Capability 4: Infrastructure As Code

Plan In PRs, Apply On Merge

Capability 5: Observability

Culture: The Part You Cannot terraform apply

A Twelve-Week Starter Plan

Next Steps

Want our team to help with your infrastructure?

Say Hello

Culture: The Part You Cannot `terraform apply`