An Observability Stack From Scratch — Prometheus, Grafana, Loki, Tempo

This is the stack I deploy first on every cluster I own. It's boring on purpose — all the components are from the Grafana/Prometheus community charts; the value is in the wiring.

Components

| Signal | Component | Storage | |---|---|---| | Metrics | kube-prometheus-stack | Thanos + S3 | | Logs | Loki (distributed) | S3 | | Traces | Tempo | S3 | | Dashboard | Grafana | PostgreSQL for config |

The wiring that matters

Every service exports:

/metrics with Prometheus scrape-friendly labels including trace_id as an exemplar.
Logs in JSON with trace_id as a field.
OpenTelemetry spans on the OTLP/HTTP endpoint.

That one shared trace_id is what lets Grafana jump from a metric spike → a set of logs → a specific trace.

Dashboards as code

All Grafana dashboards come from a git repo provisioned via the Grafana sidecar. Changes are PR-reviewed. The UI is read-only for everyone except platform engineers.

The SLO burn-rate alerting pattern

- alert: APIFastBurn
  expr: |
    sum(rate(http_requests_total{job="api",code=~"5.."}[5m]))
      / sum(rate(http_requests_total{job="api"}[5m])) > 0.14
  for: 5m

That's a multi-window, multi-burn-rate alert: 14× the SLO budget for 5 minutes means you'll exhaust the 28-day budget in two hours. Page on that.