prometheus Recording rule groups don't have consistent view of data

Recording rule groups don't have consistent view of data

Open djluck opened this issue 1 year ago • 2 comments

What did you do?

We run a number of recording rules that are used to calculate SLOs within rule groups. Each rule group performs these calculations:

Count the number of good events (good_events), e.g. count(my_metric{result="ok"})
Count the number of total events (total_events), e.g. count(my_metric)
Calculate the ratio (ratio), e.g. good_events / total_events

The rules are defined in this order within a single group.

What did you expect to see?

As the service was healthy and queries to my_metric verified no errors had occurred, we expected that good_events would equal total_events and the ratio would equal 1.

What did you see instead? Under which circumstances?

We were frequently saw on 1 of our two Prometheus HA server pairs that the ratio was not 100%. Looking at the raw values of the good_events and total_events rules, I saw that often total_events had a value + 1 greater than good_events. After digging in a bit more, I think I can see what's happening: the scrape interval for the underlying metric and the rule evaluation schedules are almost identical- this means that when a rule group evaluates, we have the potential for part of the rule group to see "more" samples after a scrape completes midway through a rule group evaluation.

I can see that in the rules manager we pass a timestamp that should limit later samples from arriving and being considered but this doesn't seem to prevent the issue from occurring.

The graph above shows the result of running timestamp(<problematic_metric_name>) - (prometheus_rule_group_last_evaluation_timestamp_seconds{rule_group="<problematic_rule_group>"}). When we saw the value of 15s, we saw no issue. When we saw the value of 30s (this is the value of both our scrape interval and the rule group evaluation frequency), this is when the issue occurred- to me this demonstrates that the rule group evaluation and scrape is happening simultaneously.

System information

Linux 5.11.0-1022-aws x86_64

Prometheus version

prometheus, version 2.29.1 (branch: HEAD, revision: dcb07e8eac34b5ea37cd229545000b857f1c1637)
  build user:       root@364730518a4e
  build date:       20210811-14:48:27
  go version:       go1.16.7
  platform:         linux/amd64

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

Apr 20 '23 03:04 djluck

I believe the current implementation aims for consistent results within a single query but not within a rule group. All targets scheduled to scrape at the same time will get the same timestamp, so that isn't enough. There is an isolation mechanism that excludes all changes made to TSDB after a query starts.

It seems a plausible enhancement to do this for rule groups.

Apr 20 '23 08:04 bboreham

Hello, this would be a huge burden to keep the same querier across different rules, especially since you have to be able to query samples from the previous rules, which is something a single querier would not be able to do.

One solution is possibly https://github.com/prometheus/prometheus/issues/11807

Apr 20 '23 09:04 roidelapluie

Thanks for your input guys. So @roidelapluie it seems that using offset 5s (or some similar value) on recording rules that reference scraped metrics would be a potential fix.

Apr 21 '23 02:04 djluck

Yes you could try that

Apr 21 '23 05:04 roidelapluie

prometheus prometheus copied to clipboard

Recording rule groups don't have consistent view of data

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

prometheus
prometheus copied to clipboard