alertmanager
alertmanager copied to clipboard
Split retention in 2 parts or multiple nflogs
We have a default of 120h retention.
While this default seems fine for the silences, it seems a lot too high for the nflog. Indeed, the nflog should ideally be kept only for ~110% of x=max( group_wait, group_interval, repeat_interval). When having a large number of alerts and a low x, alertmanager un-necessarily uses a lot of memory, because the state is broadcasted perpetually.
Here a heap of such a case: https://share.polarsignals.com/73d955e/
I see multiple ways forward:
- Multiple nflogs, with exponential durations (e.g. 1m, 1h, 12h, 10h), based on the duration the notification should stay available in the log.
- Allow setting a nflog-specific data retention policy.
Maybe we can write a new nflog where we remove the duplicates as well as part of some garbage collection process.