Incident Management Done Right: SRE Practices for On-Call Teams

Why Incident Management Matters

Every production system will eventually fail. The difference between high-performing organizations and the rest is not whether incidents happen — it is how they respond when they do.

Google's Site Reliability Engineering practices have become the industry standard for managing reliability at scale. But you do not need Google-scale infrastructure to benefit from SRE principles. Even small teams can dramatically improve their incident response by adopting structured practices around alerting, escalation, communication, and learning from failures.

Step 1: Define SLOs Before You Define Alerts

The single biggest mistake teams make is alerting on metrics without connecting them to user impact. Your monitoring should be driven by Service Level Objectives (SLOs), not arbitrary thresholds.

Start by defining what your users care about:

# slo-definition.yaml
service: checkout-api
slos:
  - name: availability
    description: "Checkout API returns successful responses"
    target: 99.95%
    window: 30d
    sli:
      type: ratio
      good: 'http_requests_total{status=~"2..|3.."}'
      total: 'http_requests_total'

  - name: latency
    description: "Checkout completes within acceptable time"
    target: 99.0%
    window: 30d
    sli:
      type: ratio
      good: 'http_request_duration_seconds_bucket{le="0.5"}'
      total: 'http_request_duration_seconds_count'

With SLOs defined, you can calculate your error budget — the amount of unreliability you can tolerate. A 99.95% availability target gives you approximately 21.6 minutes of downtime per month. When your error budget is burning faster than expected, that is when you alert.

Implementing Error Budget Alerts with Prometheus

Use multi-window, multi-burn-rate alerting as recommended by the Google SRE book:

# prometheus-rules.yaml
groups:
  - name: slo-alerts
    rules:
      # Fast burn: 14.4x burn rate over 1 hour (fires within 1h)
      - alert: CheckoutHighErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{service="checkout",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="checkout"}[1h]))
          ) > (14.4 * 0.0005)
          and
          (
            sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="checkout"}[5m]))
          ) > (14.4 * 0.0005)
        for: 2m
        labels:
          severity: critical
          team: checkout
        annotations:
          summary: "Checkout API burning error budget 14.4x faster than normal"
          runbook: "https://wiki.internal/runbooks/checkout-high-error-rate"
          dashboard: "https://grafana.internal/d/checkout-slo"

      # Slow burn: 3x burn rate over 3 days (catches gradual degradation)
      - alert: CheckoutSlowErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{service="checkout",status=~"5.."}[3d]))
            /
            sum(rate(http_requests_total{service="checkout"}[3d]))
          ) > (3 * 0.0005)
        for: 1h
        labels:
          severity: warning
          team: checkout
        annotations:
          summary: "Checkout API slowly burning through error budget"

This approach eliminates noisy alerts that do not reflect actual user impact. You page humans only when the service is violating its promise to users.

Step 2: Build an On-Call Rotation That Does Not Burn People Out

On-call should be sustainable. Teams that burn out their on-call engineers end up with high turnover, low morale, and ironically, worse reliability. Follow these principles:

Minimum two people per rotation — Primary and secondary, so no one carries the load alone
Regular rotation cadence — Weekly rotations are standard; longer than two weeks leads to fatigue
Compensate on-call time — Whether through extra pay, time off, or both
Maximum interrupt frequency — If a team gets paged more than twice per on-call shift, invest in reliability improvements before adding more features

Here is a PagerDuty schedule configured via Terraform:

# pagerduty.tf
resource "pagerduty_schedule" "checkout_oncall" {
  name      = "Checkout Team On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Primary"
    start                        = "2025-01-01T00:00:00-05:00"
    rotation_virtual_start       = "2025-01-01T00:00:00-05:00"
    rotation_turn_length_seconds = 604800 # 1 week

    users = [
      pagerduty_user.alice.id,
      pagerduty_user.bob.id,
      pagerduty_user.carol.id,
      pagerduty_user.dave.id,
    ]
  }

  layer {
    name                         = "Secondary"
    start                        = "2025-01-01T00:00:00-05:00"
    rotation_virtual_start       = "2025-01-08T00:00:00-05:00"
    rotation_turn_length_seconds = 604800

    users = [
      pagerduty_user.alice.id,
      pagerduty_user.bob.id,
      pagerduty_user.carol.id,
      pagerduty_user.dave.id,
    ]
  }
}

resource "pagerduty_escalation_policy" "checkout" {
  name      = "Checkout Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.checkout_oncall.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = pagerduty_user.engineering_manager.id
    }
  }
}

Step 3: Establish an Incident Response Framework

When an incident occurs, chaos is the enemy. Define clear roles and a structured process before you need them.

Incident Severity Levels

Severity	Definition	Response Time	Communication
SEV-1	Service completely down, all users affected	5 minutes	War room, status page, exec notification
SEV-2	Major feature degraded, significant user impact	15 minutes	War room, status page
SEV-3	Minor degradation, limited user impact	30 minutes	Team channel notification
SEV-4	No user impact, potential future issue	Next business day	Ticket created

Incident Roles

Every SEV-1 or SEV-2 incident should have these roles explicitly assigned:

Incident Commander (IC) — Coordinates response, makes decisions, drives resolution
Technical Lead — Diagnoses root cause, implements fixes
Communications Lead — Updates stakeholders, status page, and customers
Scribe — Documents timeline, actions taken, and key decisions

Automated Incident Creation with Slack and PagerDuty

Use a bot to standardize incident creation. Here is a simplified incident bot using Python:

# incident_bot.py
import slack_sdk
import requests
from datetime import datetime

class IncidentManager:
    def __init__(self, slack_token, pagerduty_token):
        self.slack = slack_sdk.WebClient(token=slack_token)
        self.pd_token = pagerduty_token

    def declare_incident(self, severity, title, reporter):
        # Create dedicated incident channel
        timestamp = datetime.now().strftime("%Y%m%d-%H%M")
        channel_name = f"inc-{timestamp}-{severity}"

        channel = self.slack.conversations_create(
            name=channel_name,
            is_private=False
        )
        channel_id = channel["channel"]["id"]

        # Post incident template
        self.slack.chat_postMessage(
            channel=channel_id,
            blocks=[
                {
                    "type": "header",
                    "text": {"type": "plain_text", "text": f"🚨 {severity.upper()}: {title}"}
                },
                {
                    "type": "section",
                    "fields": [
                        {"type": "mrkdwn", "text": f"*Reporter:* {reporter}"},
                        {"type": "mrkdwn", "text": f"*Status:* Investigating"},
                        {"type": "mrkdwn", "text": f"*IC:* Unassigned"},
                        {"type": "mrkdwn", "text": f"*Started:* {datetime.now().isoformat()}"},
                    ]
                },
                {
                    "type": "section",
                    "text": {
                        "type": "mrkdwn",
                        "text": (
                            "*Checklist:*\n"
                            "☐ Assign Incident Commander\n"
                            "☐ Assess user impact\n"
                            "☐ Update status page\n"
                            "☐ Identify mitigation steps\n"
                            "☐ Implement fix or rollback\n"
                            "☐ Verify resolution\n"
                            "☐ Schedule postmortem"
                        )
                    }
                }
            ]
        )

        # Trigger PagerDuty if SEV-1 or SEV-2
        if severity in ("sev-1", "sev-2"):
            self._trigger_pagerduty(title, severity, channel_name)

        return channel_id

    def _trigger_pagerduty(self, title, severity, channel):
        requests.post(
            "https://events.pagerduty.com/v2/enqueue",
            json={
                "routing_key": "YOUR_INTEGRATION_KEY",
                "event_action": "trigger",
                "payload": {
                    "summary": f"[{severity.upper()}] {title}",
                    "severity": "critical" if severity == "sev-1" else "error",
                    "source": f"slack/{channel}",
                }
            }
        )

Step 4: Write Blameless Postmortems

The postmortem is where organizational learning happens. Without it, you are doomed to repeat the same failures. The key word is blameless — focus on systems and processes, not individuals.

Every postmortem should cover:

Postmortem Template

## Incident Summary
- **Date:** 2025-12-15
- **Duration:** 47 minutes
- **Severity:** SEV-2
- **Impact:** 12% of checkout requests failed for users in EU region

## Timeline (UTC)
- 14:23 — Monitoring detects elevated 5xx rate on checkout-api-eu
- 14:25 — PagerDuty pages on-call engineer (Alice)
- 14:28 — Alice joins incident channel, assumes IC role
- 14:35 — Root cause identified: database connection pool exhausted
- 14:42 — Mitigation applied: increased pool size via config change
- 14:50 — Error rates return to normal
- 15:10 — Incident resolved, monitoring confirmed stable

## Root Cause
A background migration job was opened without connection limits,
consuming all available database connections in the EU region pool.

## Contributing Factors
1. Migration jobs share the same connection pool as production traffic
2. No connection limit enforced on batch operations
3. Connection pool exhaustion alert had a 15-minute delay

## Action Items
| Action | Owner | Priority | Due Date |
|--------|-------|----------|----------|
| Separate connection pool for batch jobs | Bob | P1 | 2025-12-22 |
| Add connection pool utilization alert | Carol | P1 | 2025-12-20 |
| Document migration runbook | Alice | P2 | 2026-01-05 |
| Load test batch operations | Dave | P2 | 2026-01-10 |

## Lessons Learned
- **What went well:** Fast detection (2 min), clear escalation, mitigation applied quickly
- **What could improve:** Batch jobs need resource isolation, need better pre-migration checklist

Schedule the postmortem review within 48 hours of the incident while details are fresh. Make attendance mandatory for the team and optional for the rest of the organization. Publish the postmortem widely — transparency builds trust and prevents repeated mistakes.

Step 5: Measure and Improve Over Time

Track these incident management metrics monthly:

MTTD (Mean Time to Detect) — How quickly you notice problems
MTTA (Mean Time to Acknowledge) — How quickly someone responds to a page
MTTR (Mean Time to Resolve) — Total time from detection to resolution
Incident frequency by severity — Trending up or down?
Postmortem action item completion rate — Are you actually learning?
Alert noise ratio — Percentage of alerts that required no human action

Build a Grafana dashboard that tracks these over time. If MTTR is not improving quarter over quarter, your postmortem process needs attention.

Tools We Recommend

Alerting and on-call: PagerDuty, Opsgenie, or Grafana OnCall
Incident coordination: Slack with Rootly, incident.io, or FireHydrant
Status pages: Atlassian Statuspage, Instatus, or Cachet
SLO tracking: Sloth, Pyrra, or Nobl9
Postmortem management: Jeli, Blameless, or a simple shared doc template

Conclusion

Effective incident management is not about preventing every outage — it is about building systems and processes that minimize impact and maximize learning. The organizations that handle incidents well are the ones that practice before they need to, invest in tooling, and treat every postmortem as a gift.

At DevOpsVibe, we help teams build world-class incident management processes from the ground up. From designing SLO-driven alerting to implementing automated incident workflows and training your team on postmortem facilitation, we bring real-world SRE experience to your organization. Let us help you sleep better on-call.

filed under

sreincident-managementon-callpostmortemsloobservability

pagerduty