prometheusobservabilityperformance

Prometheus Recording Rules That Earn Their Keep

Prometheus Recording Rules That Earn Their Keep

Recording rules look free. They aren't. Each one adds a series to your retention budget and a query to your evaluator loop. Before adding one, I run three checks.

1. Is the expression slow, or just big?

pprof your Prometheus. If the real cost is a sum without(...) over a million series, a recording rule helps; if the cost is that you scrape 20 targets that each emit 100k series, fix the cardinality first.

2. Will multiple dashboards reuse it?

If the expression appears in one Grafana panel, a recording rule is probably over-engineered. If four dashboards and three alerts reference the same expression, pre-compute it.

3. Is the evaluation window > 15 min?

Recording rules shine on long windows where the raw scrape series are unwieldy. For a 1-minute rate, the raw data is usually the right tool.

The naming convention I use

groups:
- name: api-slo.rules
  interval: 30s
  rules:
  - record: job:http_request_duration_seconds:p99_5m
    expr: histogram_quantile(0.99, sum by (job, le)(rate(http_request_duration_seconds_bucket[5m])))

{level}:{metric}:{rollup} — the level tells you the aggregation dimensionality, the rollup tells you the window. Do that consistently and your dashboard queries become paragraph-long, not essay-long.