Prometheus Recording Rules That Earn Their Keep
Prometheus Recording Rules That Earn Their Keep
Recording rules look free. They aren't. Each one adds a series to your retention budget and a query to your evaluator loop. Before adding one, I run three checks.
1. Is the expression slow, or just big?
pprof your Prometheus. If the real cost is a sum without(...) over a million series, a recording rule helps; if the cost is that you scrape 20 targets that each emit 100k series, fix the cardinality first.
2. Will multiple dashboards reuse it?
If the expression appears in one Grafana panel, a recording rule is probably over-engineered. If four dashboards and three alerts reference the same expression, pre-compute it.
3. Is the evaluation window > 15 min?
Recording rules shine on long windows where the raw scrape series are unwieldy. For a 1-minute rate, the raw data is usually the right tool.
The naming convention I use
groups:
- name: api-slo.rules
interval: 30s
rules:
- record: job:http_request_duration_seconds:p99_5m
expr: histogram_quantile(0.99, sum by (job, le)(rate(http_request_duration_seconds_bucket[5m])))
{level}:{metric}:{rollup} — the level tells you the aggregation dimensionality, the rollup tells you the window. Do that consistently and your dashboard queries become paragraph-long, not essay-long.