Beyond Basic Monitoring
There is a meaningful difference between monitoring and observability. Monitoring tells you when something is wrong. Observability tells you why. A production-grade observability stack combines metrics, logs, and traces into a unified view that lets your team diagnose issues in minutes instead of hours.
In this guide, we walk through the stack we deploy most frequently at DevOpsVibe: Prometheus for metrics, Grafana for visualization, Alertmanager for notifications, and Loki for log aggregation. All running on Kubernetes.
Architecture Overview
The stack consists of four primary components:
- Prometheus -- scrapes and stores time-series metrics from your services and infrastructure
- Grafana -- provides dashboards, exploration tools, and a unified query interface
- Alertmanager -- handles alert routing, deduplication, grouping, and silencing
- Loki -- a horizontally-scalable log aggregation system designed to work with Grafana
Together, these tools give you the three pillars of observability: metrics, logs, and (with the addition of Tempo) traces.
Deploying the Stack with Helm
The fastest path to production is the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, Alertmanager, and a set of preconfigured recording rules and dashboards:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values custom-values.yaml
Custom Values
Here is a production-tuned custom-values.yaml:
# custom-values.yaml
prometheus:
prometheusSpec:
retention: 30d
retentionSize: "50GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
resources:
requests:
memory: 4Gi
cpu: "2"
limits:
memory: 8Gi
scrapeInterval: 30s
evaluationInterval: 30s
grafana:
adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
persistence:
enabled: true
size: 10Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: custom
orgId: 1
folder: "Custom"
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/custom
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
Configuring Service Discovery
Prometheus automatically discovers targets in Kubernetes using ServiceMonitor and PodMonitor custom resources. To monitor a new application, create a ServiceMonitor:
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-api
namespace: monitoring
labels:
release: monitoring
spec:
selector:
matchLabels:
app: my-api
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
interval: 15s
path: /metrics
scrapeTimeout: 10s
The key detail many teams miss: the release: monitoring label must match the Helm release name. Without it, Prometheus will not pick up your ServiceMonitor.
Building Effective Dashboards
A common mistake is creating dashboards with dozens of panels that no one looks at. Instead, follow the RED and USE methods:
The RED Method (for services)
- Rate -- requests per second
- Errors -- error rate as a percentage
- Duration -- latency percentiles (p50, p95, p99)
The USE Method (for resources)
- Utilization -- percentage of resource capacity in use
- Saturation -- amount of work queued
- Errors -- count of error events
Here is a Grafana dashboard panel definition for API latency using PromQL:
{
"title": "API Latency (p99)",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"my-api\"}[5m])) by (le))",
"legendFormat": "p99"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"my-api\"}[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"my-api\"}[5m])) by (le))",
"legendFormat": "p50"
}
]
}
Alerting That Does Not Cause Fatigue
Alert fatigue is the number one reason monitoring investments fail. Engineers start ignoring alerts, and then a real incident goes unnoticed. Here is how we configure Alertmanager to prevent that:
# alertmanager-config.yaml
global:
resolve_timeout: 5m
slack_api_url: "${SLACK_WEBHOOK_URL}"
route:
receiver: "default"
group_by: ["alertname", "namespace", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-critical"
repeat_interval: 15m
- match:
severity: warning
receiver: "slack-warnings"
repeat_interval: 4h
receivers:
- name: "default"
slack_configs:
- channel: "#alerts-default"
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: "pagerduty-critical"
pagerduty_configs:
- service_key: "${PAGERDUTY_SERVICE_KEY}"
severity: critical
- name: "slack-warnings"
slack_configs:
- channel: "#alerts-warnings"
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ["alertname", "namespace"]
Writing Good Alert Rules
Every alert should be actionable. If an engineer receives an alert and does not know what to do, the alert should not exist.
# alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-alerts
namespace: monitoring
spec:
groups:
- name: api.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "{{ $labels.service }} is returning 5xx errors at {{ $value | humanizePercentage }} over the last 5 minutes."
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
> 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "High p99 latency on {{ $labels.service }}"
description: "p99 latency for {{ $labels.service }} is {{ $value }}s, exceeding 2s threshold."
Adding Log Aggregation with Loki
Deploy Loki alongside your Prometheus stack to unify metrics and logs in Grafana:
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.size=50Gi \
--set loki.persistence.storageClassName=gp3
Then add Loki as a data source in Grafana. The power of this setup is the ability to jump from a metric spike directly to the corresponding logs using Grafana's split view and label matching.
A typical LogQL query to find errors:
{namespace="production", app="my-api"} |= "error" | json | level="error" | line_format "{{.timestamp}} {{.message}}"
SLO-Based Monitoring
Rather than alerting on arbitrary thresholds, define Service Level Objectives and alert on burn rates:
- SLI: The ratio of successful requests to total requests
- SLO: 99.9% of requests should succeed over a 30-day window
- Error budget: 0.1% of requests can fail (approximately 43 minutes of downtime per month)
Alert when your error budget is burning too fast:
- alert: ErrorBudgetBurnRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
/
sum(rate(http_requests_total[1h])) by (service)
) > (14.4 * 0.001)
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget burning fast for {{ $labels.service }}"
Operational Tips
- Right-size your retention. 30 days of high-resolution data with downsampled long-term storage via Thanos or Cortex.
- Use recording rules to precompute expensive queries. This dramatically improves dashboard load times.
- Separate alerting Prometheus from dashboarding Prometheus in large environments to avoid query load affecting alert evaluation.
- Store dashboards as code in version control using Grafana's provisioning system or tools like Grafonnet.
- Test your alerts. Use
promtoolto unit test alert rules before deploying them.
Conclusion
A well-built observability stack is not a luxury -- it is the foundation that makes everything else possible. Without it, incident response is guesswork, capacity planning is a gamble, and performance optimization is shooting in the dark.
At DevOpsVibe, we design and deploy observability platforms tailored to your infrastructure. From initial setup to custom dashboards and SLO frameworks, we make sure your team has the visibility it needs. Get in touch to discuss your monitoring needs.