An Observability Stack From Scratch — Prometheus, Grafana, Loki, Tempo
This is the stack I deploy first on every cluster I own. It's boring on purpose — all the components are from the Grafana/Prometheus community charts; the value is in the wiring.
Components
| Signal | Component | Storage | |---|---|---| | Metrics | kube-prometheus-stack | Thanos + S3 | | Logs | Loki (distributed) | S3 | | Traces | Tempo | S3 | | Dashboard | Grafana | PostgreSQL for config |
The wiring that matters
Every service exports:
/metricswith Prometheus scrape-friendly labels includingtrace_idas an exemplar.- Logs in JSON with
trace_idas a field. - OpenTelemetry spans on the OTLP/HTTP endpoint.
That one shared trace_id is what lets Grafana jump from a metric spike → a set of logs → a specific trace.
Dashboards as code
All Grafana dashboards come from a git repo provisioned via the Grafana sidecar. Changes are PR-reviewed. The UI is read-only for everyone except platform engineers.
The SLO burn-rate alerting pattern
- alert: APIFastBurn
expr: |
sum(rate(http_requests_total{job="api",code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m])) > 0.14
for: 5m
That's a multi-window, multi-burn-rate alert: 14× the SLO budget for 5 minutes means you'll exhaust the 28-day budget in two hours. Page on that.