prometheus icon indicating copy to clipboard operation
prometheus copied to clipboard

Rules: Persist alerts 'keep_firing_for' state using file storage

Open mustafain117 opened this issue 1 year ago • 11 comments

This PR resolves issue #13957

This is a second attempt using a file based storage to add persistence for alerts 'keep_firing_for' state which is currently lost whenever Prometheus restarts. Feedback on previous attempt was to use native storage similar to ALERTS_FOR_STATE.

Justification for not using TSDB:

  • In case of alerts using keep_firing_for, when Prometheus restarts the alerts are no longer active since the expression was resolved prior to restart which means we lose the alert labels, annotations and value. To restore the alert we need more than just the KeepFiringSince timestamp.
    • For each alert, need to store 2 label sets (alert labels & alert annotations), and 3 floating point values (ActiveAt and KeepFiringSince time, Value of alert expression)
  • Since labels and annotations are dependent on user config, it can lead to high label cardinality samples.
  • Inefficient to store multiple samples per alert when we only need to maintain latest alert state. Using an object/key-value store is more appropriate for this use case. There is no need to retain previous samples/data.
  • Additional cost of samples to Prometheus users

Implementation details:

  • Feature is behind a flag (alert-state-persistence), storage path can be set using rules.alert.state-storage-path
  • Added a hook to store alerts, invoked at the end of each group evaluation. State is restored from storage during the first group evaluation.
  • Store uses an in memory map to optimize reads, map is populated when store is initialized. I.e we only read the file once initially.
  • When storing alerts, in memory map is updated before flushing state to file. Alerts are stored after each group evaluation given for each rule where keep_firing_for is set.

Performance with ~1000 rule groups: each spike represents a server restart: image

mustafain117 avatar Jul 16 '24 06:07 mustafain117

I'm not entirely convinced by your argument this cannot be done via TSDB, but I had some comments and questions on this implementation.

Updated the current implementation based on comments.

This can be done via TSDB but it is not ideal (specially for downstream consumers like Cortex). Please see the updated justification to not use TSDB in the description above.

mustafain117 avatar Jul 30 '24 18:07 mustafain117

Thank you so much for the review. I have added some more details on the TSDB solution I explored in this google doc.

Hope it clarifies my earlier justification, please let me know if you have you any concerns/questions.

mustafain117 avatar Aug 13 '24 16:08 mustafain117

The CI complains about the missing license text in ./rules/store.go . (That's just a technicality and easy to fix. I hope @bboreham will get back to the actual code level review soon.)

beorn7 avatar Sep 04 '24 11:09 beorn7

I asked if my colleagues who work on alerts would take a look.

bboreham avatar Sep 04 '24 12:09 bboreham

Thank you for the feedback, I will resolve as soon as possible.

High-level, I'm pondering that would have alert state persisted in three places - ALERTS series, ALERTS_FOR_STATE series, and a JSON file. Could the last two be combined?

Yes, ALERTS_FOR_STATE can be combined into the JSON file if we store all alerts in SetAlerts(It currently only stores alerts for rules that use keep_firing_for).

While alert state would be persisted in three places, each has a different use case.

mustafain117 avatar Sep 10 '24 15:09 mustafain117

Hello from the bug-scrub!

I think there are still some comments outstanding from the last review, plus some merge conflicts. Still seems like a valid enhancement.

bboreham avatar Dec 10 '24 12:12 bboreham

Hello! I’m working on addressing the outstanding comments and resolving the merge conflicts. Thanks for your attention to this!

mustafain117 avatar Dec 10 '24 18:12 mustafain117

PR is ready for review again cc: @bboreham

mustafain117 avatar Apr 25 '25 18:04 mustafain117

Heya @bboreham hope you're well! Trying to figure out what's blocking this? Is there anything on our end we need to do?

mhausenblas avatar Jun 20 '25 09:06 mhausenblas

Hi there, sorry I missed those pings.

bboreham avatar Jul 08 '25 11:07 bboreham

Hello from the bug scrub!

@bboreham please take a look.

krajorama avatar Dec 02 '25 11:12 krajorama