prometheus icon indicating copy to clipboard operation
prometheus copied to clipboard

Feature: Allow configuration of a rule evaluation delay

Open gotjosh opened this issue 1 year ago • 7 comments

Now that Prometheus is allowed to be a remote write target, we think this is highly useful.

This basically enables Prometheus to do two things:

  1. Allow setting a server-wide rule evaluation delay.
  2. Allow the overriding of such a setting at a rule group level.

The main motivation of this work is to allow remote write receivers (in this particular case Prometheus itself) to delay alert rule evaluation by a certain period to take into account any sort of remote write delays.

In the early days of Mimir, our customers observed plenty of gaps in recording rules and flapping alerts due to remote write delays. Network anomalies are a factor of when and not if and this change help mitigate most of that flapping. In Mimir, we run it with a default of 1 minute but for Prometheus I'm proposing a default of 0 to avoid any sort of breaking change and leave it mostly as a tool for operators to decide how do they want to balance out time of evaluation for recording rules and alerts vs tolerance for remote write delays.

For context, this has been running in Mimir for years.

** Note for Reviewers **

I've tried adhere as much as possible to the original author's code where it made sense -- however, there was a huge refactor on the rules package since this code was introduce so a lot of merge I had to manually port it to the right place so please take a look carefully.

gotjosh avatar May 07 '24 16:05 gotjosh

Should we update the documentation here and here, to mention the new field in the rules config file?

I have updated the docs in here as this is where the have the reference to the file format.

gotjosh avatar May 09 '24 10:05 gotjosh

@juliusv @roidelapluie @bwplotka @beorn7 @ArthurSens any thoughts?

gotjosh avatar May 09 '24 10:05 gotjosh

rule evaluation delay to me sounds like the rule query won't be executed immediately but rather after a while. But this change, I think, is to use a constant offset when querying metrics during rule evaluation. How about rule evaluation offset ?

prymitive avatar May 10 '24 12:05 prymitive

The closest analogy would be the offset modifier we have for selectors and subqueries in PromQL. Maybe the feature should be called "rule query offset" or something like that?

I don't have a strong opinion on name and I'm happy to go whichever direction you think it's right - Just want to say that a rename would be a big source of pain for us as we'll have to deprecate flags and change the config file format but that's not a Prometheus problem.

Just want to make sure the whatever name we end up picking you feel strongly about it - and it's not one of those "it would be nice if".

gotjosh avatar May 16 '24 09:05 gotjosh

I am on Julius' side, while Reading the PR I was thinking it was slipping evals for a certain time, not evaluating in the past.

roidelapluie avatar May 16 '24 10:05 roidelapluie

I don't have a strong opinion on name and I'm happy to go whichever direction you think it's right - Just want to say that a rename would be a big source of pain for us as we'll have to deprecate flags and change the config file format but that's not a Prometheus problem.

Yeah, I understand an am sorry for the pain, but I also believe it's worth to make a user-facing option as clear as possible. Even users who don't need it will have to read it in the docs and try to understand what it does and whether they need it.

I think even "offset" alone isn't fully clear, as that could just mean an offset in when rule evaluations are triggered, without changing the actual timestamp of queries into the past. So that's why I'd suggest rule_query_offset / ruleQueryOffset. This can just be queryOffset within the rules package and query_offset for the per-group option, as the rules context is clear there.

I can see a world where you might want to control the flag at a server level

Not sure I follow 100% - whether it's a flag or a global config option, both would control the option at a server level. A config option would just be more consistent in terms of how our existing configuration overrides work: I don't think we have a flag that can be overridden by any config option anywhere, but we have global config options that can be overridden in more specific places (e.g. rule evaluation intervals). That's just why it'd feel more natural to me to have the rule query offset global default setting in the config file as well (vs. as a flag). The other implication of putting something into the config vs. flag is that the setting has to be reloadable during runtime, which I hope is doable for the global rule query offset.

perhaps what I should introduce is a guard to make sure the config option cannot be lower than the server flag which I think it's what makes sense in my head.

I could imagine both configuring lower and higher offsets for specific groups, depending on whether most of your data comes from a regular source or from a delayed one (then you'd choose the default offset to be according to that and either configure lower or higher offsets for specific exceptional groups that use data from a different source).

juliusv avatar May 16 '24 10:05 juliusv

Yeah, I understand an am sorry for the pain, but I also believe it's worth to make a user-facing option as clear as possible. Even users who don't need it will have to read it in the docs and try to understand what it does and whether they need it.

Sounds good to me!

I think even "offset" alone isn't fully clear, as that could just mean an offset in when rule evaluations are triggered, without changing the actual timestamp of queries into the past. So that's why I'd suggest rule_query_offset / ruleQueryOffset. This can just be queryOffset within the rules package and query_offset for the per-group option, as the rules context is clear there.

This makes sense to me, of all the options, I do like rule query offset / query offset the most. I'll make this change.

Not sure I follow 100% - whether it's a flag or a global config option, both would control the option at a server level. A config option would just be more consistent in terms of how our existing configuration overrides work: I don't think we have a flag that can be overridden by any config option anywhere, but we have global config options that can be overridden in more specific places (e.g. rule evaluation intervals). That's just why it'd feel more natural to me to have the rule query offset global default setting in the config file as well (vs. as a flag). The other implication of putting something into the config vs. flag is that the setting has to be reloadable during runtime, which I hope is doable for the global rule query offset.

🤦 Sorry I misunderstood your argument entirely - yeah, I think this change makes perfect sense. The pattern of global config can be overrideable via a more specific configuration option of a component is very common in Prometheus world.

I could imagine both configuring lower and higher offsets for specific groups, depending on whether most of your data comes from a regular source or from a delayed one (then you'd choose the default offset to be according to that and either configure lower or higher offsets for specific exceptional groups that use data from a different source).

Sounds good - I'll ship it without the guard and we can evaluate it as i evolves.

gotjosh avatar May 16 '24 11:05 gotjosh

@juliusv @roidelapluie I think I have addressed all your concerns now. Let me know if you have any other thoughts.

You can look at the last 4 commits if you'd like to know what changes since you last (the force push was a result of some tests that changed in main for manager_test.go)

gotjosh avatar May 20 '24 15:05 gotjosh

@juliusv @roidelapluie @machine424 I have more work I'd like to do around the rules engine so if possible, I'd love to move forward. I'll take your silence as consensus and attempt to merge it tomorrow unless I hear otherwise.

gotjosh avatar May 28 '24 15:05 gotjosh

Btw. :+1: besides my last comment/question about the pointer.

juliusv avatar May 29 '24 10:05 juliusv

I'm going to proceed with the merge. If there's any additional feedback, please let me know so I can follow up immediately.

gotjosh avatar May 30 '24 10:05 gotjosh