mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Change default evaluation delay of ruler

Open pstibrany opened this issue 2 years ago • 3 comments

Ruler uses "evaluation delay" to compute the evaluation timestamp of queries that ruler executes as part of running the rule group evaluation. With evaluation delay of say 5 minutes, when ruler executes rule group, ruler would query data from "now - 5 minutes" and write back resulting sample with timestamp "now - 5 minutes". Increased evaluation delay decreases risk that rule executes on incomplete data, as there may be data that wasn't yet pushed to Mimir. On the other hand it delays the rule output, or alert generation.

Default rule evaluation delay in Mimir is 0 (ruler.evaluation-delay-duration option). Given that under normal operation Mimir always receives samples "from the past" (normal operation meaning: at time "T" Prometheus or Agent scrape the targets, assign all scraped values timestamp "T" and then forward such samples to Mimir), this default value doesn't seem very good -- it means that rules always execute on incomplete data.

In Grafana Cloud we run with evaluation delay of 1 minute.

To provide safe default value, I propose to change default evaluation delay of ruler to 1 minute.

We should also make this value more visible in Jsonnet and Helm configuration, so that Mimir users (operators) are aware of it and can choose their own value based on their environment, ie. delay if incoming samples.

pstibrany avatar Dec 19 '22 10:12 pstibrany

I would also recommend to set the default of ruler_evaluation_delay_duration to 1m which have fixed my issue.

wilfriedroset avatar Dec 19 '22 10:12 wilfriedroset

I would also be happy to set a safe default. We use 1m at Grafana Labs.

pracucci avatar Dec 19 '22 11:12 pracucci

Personal view: I'm unhappy with 1 minute. When you include scrape interval and evaluation interval it means any dashboard panels built from recording rules are around 1.5 minutes behind real-time, and I don't like to wait that long to find out what is happening.

The 1 minute offset between panels built from raw data and those showing recording rules is also annoying.

bboreham avatar Dec 19 '22 11:12 bboreham