prometheus
prometheus copied to clipboard
Recording rule groups don't have consistent view of data
What did you do?
We run a number of recording rules that are used to calculate SLOs within rule groups. Each rule group performs these calculations:
- Count the number of good events (
good_events
), e.g.count(my_metric{result="ok"})
- Count the number of total events (
total_events
), e.g.count(my_metric)
- Calculate the ratio (
ratio
), e.g.good_events / total_events
The rules are defined in this order within a single group.
What did you expect to see?
As the service was healthy and queries to my_metric
verified no errors had occurred, we expected that good_events
would equal total_events
and the ratio
would equal 1.
What did you see instead? Under which circumstances?
We were frequently saw on 1 of our two Prometheus HA server pairs that the ratio was not 100%. Looking at the raw values of the good_events
and total_events
rules, I saw that often total_events
had a value + 1 greater than good_events
. After digging in a bit more, I think I can see what's happening: the scrape interval for the underlying metric and the rule evaluation schedules are almost identical- this means that when a rule group evaluates, we have the potential for part of the rule group to see "more" samples after a scrape completes midway through a rule group evaluation.
I can see that in the rules manager we pass a timestamp that should limit later samples from arriving and being considered but this doesn't seem to prevent the issue from occurring.
The graph above shows the result of running
timestamp(<problematic_metric_name>) - (prometheus_rule_group_last_evaluation_timestamp_seconds{rule_group="<problematic_rule_group>"})
. When we saw the value of 15s, we saw no issue. When we saw the value of 30s (this is the value of both our scrape interval and the rule group evaluation frequency), this is when the issue occurred- to me this demonstrates that the rule group evaluation and scrape is happening simultaneously.
System information
Linux 5.11.0-1022-aws x86_64
Prometheus version
prometheus, version 2.29.1 (branch: HEAD, revision: dcb07e8eac34b5ea37cd229545000b857f1c1637)
build user: root@364730518a4e
build date: 20210811-14:48:27
go version: go1.16.7
platform: linux/amd64
Prometheus configuration file
No response
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
No response
I believe the current implementation aims for consistent results within a single query but not within a rule group. All targets scheduled to scrape at the same time will get the same timestamp, so that isn't enough. There is an isolation mechanism that excludes all changes made to TSDB after a query starts.
It seems a plausible enhancement to do this for rule groups.
Hello, this would be a huge burden to keep the same querier across different rules, especially since you have to be able to query samples from the previous rules, which is something a single querier would not be able to do.
One solution is possibly https://github.com/prometheus/prometheus/issues/11807
Thanks for your input guys. So @roidelapluie it seems that using offset 5s
(or some similar value) on recording rules that reference scraped metrics would be a potential fix.
Yes you could try that