Rules: Persist alerts 'keep_firing_for' state using file storage
This PR resolves issue #13957
This is a second attempt using a file based storage to add persistence for alerts 'keep_firing_for' state which is currently lost whenever Prometheus restarts. Feedback on previous attempt was to use native storage similar to ALERTS_FOR_STATE.
Justification for not using TSDB:
- In case of alerts using keep_firing_for, when Prometheus restarts the alerts are no longer active since the expression was resolved prior to restart which means we lose the alert labels, annotations and value. To restore the alert we need more than just the
KeepFiringSincetimestamp.- For each alert, need to store 2 label sets (alert labels & alert annotations), and 3 floating point values (ActiveAt and KeepFiringSince time, Value of alert expression)
- Since labels and annotations are dependent on user config, it can lead to high label cardinality samples.
- Inefficient to store multiple samples per alert when we only need to maintain latest alert state. Using an object/key-value store is more appropriate for this use case. There is no need to retain previous samples/data.
- Additional cost of samples to Prometheus users
Implementation details:
- Feature is behind a flag (
alert-state-persistence), storage path can be set usingrules.alert.state-storage-path - Added a hook to store alerts, invoked at the end of each group evaluation. State is restored from storage during the first group evaluation.
- Store uses an in memory map to optimize reads, map is populated when store is initialized. I.e we only read the file once initially.
- When storing alerts, in memory map is updated before flushing state to file. Alerts are stored after each group evaluation given for each rule where
keep_firing_foris set.
Performance with ~1000 rule groups:
each spike represents a server restart:
I'm not entirely convinced by your argument this cannot be done via TSDB, but I had some comments and questions on this implementation.
Updated the current implementation based on comments.
This can be done via TSDB but it is not ideal (specially for downstream consumers like Cortex). Please see the updated justification to not use TSDB in the description above.
Thank you so much for the review. I have added some more details on the TSDB solution I explored in this google doc.
Hope it clarifies my earlier justification, please let me know if you have you any concerns/questions.
The CI complains about the missing license text in ./rules/store.go . (That's just a technicality and easy to fix. I hope @bboreham will get back to the actual code level review soon.)
I asked if my colleagues who work on alerts would take a look.
Thank you for the feedback, I will resolve as soon as possible.
High-level, I'm pondering that would have alert state persisted in three places - ALERTS series, ALERTS_FOR_STATE series, and a JSON file. Could the last two be combined?
Yes, ALERTS_FOR_STATE can be combined into the JSON file if we store all alerts in SetAlerts(It currently only stores alerts for rules that use keep_firing_for).
While alert state would be persisted in three places, each has a different use case.
Hello from the bug-scrub!
I think there are still some comments outstanding from the last review, plus some merge conflicts. Still seems like a valid enhancement.
Hello! I’m working on addressing the outstanding comments and resolving the merge conflicts. Thanks for your attention to this!
PR is ready for review again cc: @bboreham
Heya @bboreham hope you're well! Trying to figure out what's blocking this? Is there anything on our end we need to do?
Hi there, sorry I missed those pings.
Hello from the bug scrub!
@bboreham please take a look.