alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Split retention in 2 parts or multiple nflogs

Open roidelapluie opened this issue 3 years ago • 1 comments

We have a default of 120h retention.

While this default seems fine for the silences, it seems a lot too high for the nflog. Indeed, the nflog should ideally be kept only for ~110% of x=max( group_wait, group_interval, repeat_interval). When having a large number of alerts and a low x, alertmanager un-necessarily uses a lot of memory, because the state is broadcasted perpetually.

Here a heap of such a case: https://share.polarsignals.com/73d955e/

I see multiple ways forward:

  • Multiple nflogs, with exponential durations (e.g. 1m, 1h, 12h, 10h), based on the duration the notification should stay available in the log.
  • Allow setting a nflog-specific data retention policy.

roidelapluie avatar Jun 23 '22 08:06 roidelapluie

Maybe we can write a new nflog where we remove the duplicates as well as part of some garbage collection process.

roidelapluie avatar Jun 23 '22 09:06 roidelapluie