prometheus-engine icon indicating copy to clipboard operation
prometheus-engine copied to clipboard

RulesGroup doesn't follow Prometheus spec

Open optix2000 opened this issue 1 year ago • 6 comments

According to the docs:

RuleGroup declares rules in the Prometheus format: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/

However in Prometheus rule_group's, interval is optional. For the RuleGroup, interval is required, which is different from what's expected.

ClusterRules.monitoring.googleapis.com "foo" is invalid: [spec.groups[0].interval: Required value]

optix2000 avatar Apr 11 '24 20:04 optix2000

This means we cannot copy rules 1:1 into the ClusterRules CRD.

optix2000 avatar Apr 11 '24 20:04 optix2000

Hi, thanks for reporting!

This is intentional as there is no global interval setting for rule evaluations (aka global.evaluation_interval), because that option only exists in Prometheus. In GMP it's a separate component that evaluates rules.

This means we cannot copy rules 1:1 into the ClusterRules CRD.

That's true, this means you have to manually/via sed/script add interval explicitly.

Just to understand the consequences of this decision -- this is one-off annoyance though, right? Or is it causing more problems for you?

bwplotka avatar Apr 12 '24 17:04 bwplotka

This is mostly a one-off annoyance, but causes confusion as we cannot point people directly to the Prometheus docs.

Given all the components are managed by GCP, it feels like it should just use the value of global.evaluation_interval configured in alertmanager, or the default of 1m.

optix2000 avatar Apr 12 '24 18:04 optix2000

Got it, thanks.

I assume you can point people directly to the Prometheus docs, but you have to mention the nuance to specify that one field explicitly, right?

The point is that evaluation interval is also useful to specify explicit. It matters, depending on the rule. Do you think a global ruleEvaluator flag (e.g. in OperatorConfig) would be good enough here?

global.evaluation_interval configured in alertmanager, or the default of 1m.

There is no such setting in AM, probably you mean Prometheus - which is not relevant in our stack as we don't use Prometheus for rule evals?

Default to 1m is an option, yes, maybe not too bad.

bwplotka avatar Apr 17 '24 15:04 bwplotka

I assume you can point people directly to the Prometheus docs, but you have to mention the nuance to specify that one field explicitly, right?

That's exactly what we do, but not everyone reads the nuance, especially if they have used prometheus/alertmanager before.

There is no such setting in AM, probably you mean Prometheus - which is not relevant in our stack as we don't use Prometheus for rule evals?

Ah got it.

Default to 1m is an option, yes, maybe not too bad.

This would be my preference, as it's the Prom default

optix2000 avatar Apr 23 '24 22:04 optix2000

Thanks. Moving this to prioritization queue (OSS help wanted too).

Acceptance Criteria

  • Add omitempty with 1m default https://github.com/GoogleCloudPlatform/prometheus-engine/blob/2c7b2fd2c9e27648efc6d1024dfcbaca73709bfb/pkg/operator/apis/monitoring/v1/rules_types.go#L118

bwplotka avatar Apr 25 '24 11:04 bwplotka

Fixed in #1375. Interval is now 1m by default.

bernot-dev avatar Mar 06 '25 20:03 bernot-dev