cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Minimize missed rule group evaluations

Open rajagopalanand opened this issue 1 year ago • 1 comments

What this PR does:

Currently once a Ruler instance loads a rule group, it evaluates it continuously. If the instance evaluating the rule group becomes unavailable, there is a high chance for missed evaluations before the instance becomes available again or another instance loads and evaluates the rule group. Ruler instances can become unavailable for variety of reasons including bad underlying nodes and OOM kills. This issue can be exacerbated if the Ruler instance appears as healthy within the cluster ring but is actually in an unhealthy state.

This PR addresses the problem by introducing a check to ensure that the primary Ruler is alive and in a running state. Here’s how it works:

  1. Liveness Check: Non-primary Rulers will perform a liveness check on the primary Ruler for each rule group when syncing rule groups from external storage.
  2. Fallback Mechanism: If the primary Ruler is unresponsive or not in a running state, the non-primary Ruler will assume ownership of the rule group and take over its evaluation.
  3. Relinquish Ownership: If the primary Ruler is alive and running, and if non-primary ruler has ownership of the rule group, then it relinquishes ownership of that rule group by not taking ownership and unloading it from Prometheus rule manager

With this change, the maximum duration of missed evaluations will be limited to the sync interval of the rule groups, reducing the impact of primary Ruler unavailability.

Checklist

  • [x] Tests updated
  • [x] Documentation added
  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

rajagopalanand avatar Jul 30 '24 18:07 rajagopalanand

We are in the process of releasing 1.18 Please rebase PR and change the changelog to master

danielblando avatar Aug 16 '24 18:08 danielblando

Approved... other than changing the ruler timeout config to be under the ruler_client section, it LGTM

alanprot avatar Aug 30 '24 22:08 alanprot

Approved... other than changing the ruler timeout config to be under the ruler_client section, it LGTM

Pushed up a new commit. Please take a look

rajagopalanand avatar Aug 31 '24 17:08 rajagopalanand

Note that https://github.com/cortexproject/cortex/pull/5862 needs to be merged before this PR. Without enough approvals on the proposal itself, I don't think we can merge the actual implementation.

yeya24 avatar Sep 03 '24 06:09 yeya24