Building a Production Observability Stack with Prometheus and Grafana

Beyond Basic Monitoring

There is a meaningful difference between monitoring and observability. Monitoring tells you when something is wrong. Observability tells you why. A production-grade observability stack combines metrics, logs, and traces into a unified view that lets your team diagnose issues in minutes instead of hours.

In this guide, we walk through the stack we deploy most frequently at DevOpsVibe: Prometheus for metrics, Grafana for visualization, Alertmanager for notifications, and Loki for log aggregation. All running on Kubernetes.

Architecture Overview

The stack consists of four primary components:

Prometheus -- scrapes and stores time-series metrics from your services and infrastructure
Grafana -- provides dashboards, exploration tools, and a unified query interface
Alertmanager -- handles alert routing, deduplication, grouping, and silencing
Loki -- a horizontally-scalable log aggregation system designed to work with Grafana

Together, these tools give you the three pillars of observability: metrics, logs, and (with the addition of Tempo) traces.

Deploying the Stack with Helm

The fastest path to production is the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, Alertmanager, and a set of preconfigured recording rules and dashboards:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values custom-values.yaml

Custom Values

Here is a production-tuned custom-values.yaml:

# custom-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "50GB"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi
    resources:
      requests:
        memory: 4Gi
        cpu: "2"
      limits:
        memory: 8Gi
    scrapeInterval: 30s
    evaluationInterval: 30s

grafana:
  adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
  persistence:
    enabled: true
    size: 10Gi
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: custom
          orgId: 1
          folder: "Custom"
          type: file
          disableDeletion: false
          editable: true
          options:
            path: /var/lib/grafana/dashboards/custom

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 5Gi

Configuring Service Discovery

Prometheus automatically discovers targets in Kubernetes using ServiceMonitor and PodMonitor custom resources. To monitor a new application, create a ServiceMonitor:

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-api
  namespace: monitoring
  labels:
    release: monitoring
spec:
  selector:
    matchLabels:
      app: my-api
  namespaceSelector:
    matchNames:
      - production
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
      scrapeTimeout: 10s

The key detail many teams miss: the release: monitoring label must match the Helm release name. Without it, Prometheus will not pick up your ServiceMonitor.

Building Effective Dashboards

A common mistake is creating dashboards with dozens of panels that no one looks at. Instead, follow the RED and USE methods:

The RED Method (for services)

Rate -- requests per second
Errors -- error rate as a percentage
Duration -- latency percentiles (p50, p95, p99)

The USE Method (for resources)

Utilization -- percentage of resource capacity in use
Saturation -- amount of work queued
Errors -- count of error events

Here is a Grafana dashboard panel definition for API latency using PromQL:

{
  "title": "API Latency (p99)",
  "type": "timeseries",
  "targets": [
    {
      "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"my-api\"}[5m])) by (le))",
      "legendFormat": "p99"
    },
    {
      "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"my-api\"}[5m])) by (le))",
      "legendFormat": "p95"
    },
    {
      "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"my-api\"}[5m])) by (le))",
      "legendFormat": "p50"
    }
  ]
}

Alerting That Does Not Cause Fatigue

Alert fatigue is the number one reason monitoring investments fail. Engineers start ignoring alerts, and then a real incident goes unnoticed. Here is how we configure Alertmanager to prevent that:

# alertmanager-config.yaml
global:
  resolve_timeout: 5m
  slack_api_url: "${SLACK_WEBHOOK_URL}"

route:
  receiver: "default"
  group_by: ["alertname", "namespace", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: "slack-warnings"
      repeat_interval: 4h

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts-default"
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SERVICE_KEY}"
        severity: critical

  - name: "slack-warnings"
    slack_configs:
      - channel: "#alerts-warnings"

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "namespace"]

Writing Good Alert Rules

Every alert should be actionable. If an engineer receives an alert and does not know what to do, the alert should not exist.

# alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-alerts
  namespace: monitoring
spec:
  groups:
    - name: api.rules
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
            > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate on {{ $labels.service }}"
            description: "{{ $labels.service }} is returning 5xx errors at {{ $value | humanizePercentage }} over the last 5 minutes."
            runbook: "https://wiki.example.com/runbooks/high-error-rate"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
            > 2.0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High p99 latency on {{ $labels.service }}"
            description: "p99 latency for {{ $labels.service }} is {{ $value }}s, exceeding 2s threshold."

Adding Log Aggregation with Loki

Deploy Loki alongside your Prometheus stack to unify metrics and logs in Grafana:

helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=50Gi \
  --set loki.persistence.storageClassName=gp3

Then add Loki as a data source in Grafana. The power of this setup is the ability to jump from a metric spike directly to the corresponding logs using Grafana's split view and label matching.

A typical LogQL query to find errors:

{namespace="production", app="my-api"} |= "error" | json | level="error" | line_format "{{.timestamp}} {{.message}}"

SLO-Based Monitoring

Rather than alerting on arbitrary thresholds, define Service Level Objectives and alert on burn rates:

SLI: The ratio of successful requests to total requests
SLO: 99.9% of requests should succeed over a 30-day window
Error budget: 0.1% of requests can fail (approximately 43 minutes of downtime per month)

Alert when your error budget is burning too fast:

- alert: ErrorBudgetBurnRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
      /
      sum(rate(http_requests_total[1h])) by (service)
    ) > (14.4 * 0.001)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning fast for {{ $labels.service }}"

Operational Tips

Right-size your retention. 30 days of high-resolution data with downsampled long-term storage via Thanos or Cortex.
Use recording rules to precompute expensive queries. This dramatically improves dashboard load times.
Separate alerting Prometheus from dashboarding Prometheus in large environments to avoid query load affecting alert evaluation.
Store dashboards as code in version control using Grafana's provisioning system or tools like Grafonnet.
Test your alerts. Use promtool to unit test alert rules before deploying them.

Conclusion

A well-built observability stack is not a luxury -- it is the foundation that makes everything else possible. Without it, incident response is guesswork, capacity planning is a gamble, and performance optimization is shooting in the dark.

At DevOpsVibe, we design and deploy observability platforms tailored to your infrastructure. From initial setup to custom dashboards and SLO frameworks, we make sure your team has the visibility it needs. Get in touch to discuss your monitoring needs.

filed under

monitoringobservability

prometheus

grafana

kubernetesalerting