alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Have a way to mute alert until it's resolved to receive a resolved notification once it's fixed

Open freak12techno opened this issue 1 year ago • 12 comments

Let's say I have an outage on one of my server I'm monitoring and it's inaccessible but I don't know how long it's gonna take to fix it, so I'm muting it for a really long time.

With this approach, I won't receive any resolved notifications, so to check if the alert is fixed I need to go to my alerts list to see if it's still firing, and given that I've muted it for a long time I also need to remove the mute to know if it's firing again.

What would be nice to have:

  • I have a server outage
  • an alert is triggered
  • I create a new mute and somehow specify that I need it to be active until the alert is resolved
  • a server fixes itself
  • a resolved alert notification is dispatched
  • a mute is removed
  • if a server starts misbehaving again and a new alert is triggered, I'm receiving an alert notification again

Pretty sure this would have a lot of cases that'll make it difficult, like if a mute has a lot of active alerts, but still would be really awesome to have.

Do you guys think it's manageable?

freak12techno avatar Apr 30 '24 15:04 freak12techno

Hi! :wave: It sounds to me like you want a silence to expire when it is no longer silencing any active alerts.

I think there are a couple of problems that we would need to solve to add such a feature. For example:

  1. Alertmanager does not persist alerts to disk, so if an Alertmanager is restarted all of its alerts will be lost; and because of this all of its silences will also be expired. This is undesirable because the alert might not have actually resolved. Prometheus will resend all alerts to Alertmanager after the resend delay and you may receive duplicate notifications.
  2. If Alertmanager is run in HA (high availability) and one Alertmanager becomes partitioned, then its alerts will resolve as Prometheus will be unable to communicate to that Alertmanager. The partitioned Alertmanager will expire its silences as it has no more active alerts. When the partition recovers the Alertmanager that expired its silences will gossip these expirations to the other Alertmanagers, expiring them on those Alertmanagers too.

grobinson-grafana avatar May 04 '24 14:05 grobinson-grafana

@grobinson-grafana seems so.

For issues that you outlined:

  1. Can this be solved by saving alerts to disk every time Alertmanager is receiving one and loading it from disk if it's present, or do you think there are other caveats with this approach?
  2. I don't know a lot about Alertmanager HA internals, but let's say if there is a cluster of 3 nodes and one is partitioned and loses the mute given all the alerts there are resolved, once it goes back, won't other 2 nodes disagree and won't the consensus be that this mute isn't in fact removed? (and if there are 2/3 nodes partitioned, pretty sure expiring mutes aren't gonna be the biggest problem here lol)

freak12techno avatar May 04 '24 14:05 freak12techno

  1. Yes that's right! The problem is that Alertmanager is stateless, so some kind of embedded database will need to be evaluated and then all the code will need to be written to use it.
  2. Alertmanager doesn't use consensus for gossiping silences, its a case of last write wins. Since the expiration was the most recent event the other Alertmanagers will believe it to be the correct one.

grobinson-grafana avatar May 04 '24 15:05 grobinson-grafana

@grobinson-grafana

  1. From what I'm seeing, it shouldn't be difficult:
  • when creating/editing/deleting the alert, just dump whatever alerts there are to the disk
  • when starting, load the alerts from state if it's present
  • afaik it's not a proper database, but basically alerts snapshot, so no need to sync between the file and Alertmanager in cases other than the two above

Do you think that introduces new troubles?

  1. Were the team ever considering using the consensus model? Wonder if it has any payoffs other than kinda being the requirement for the feature I propose and if it adds more problems.

freak12techno avatar May 04 '24 16:05 freak12techno

  1. Do you think you have time to work on this? I think the best place to start is to evaluate some of the embedded k/v stores such as bbolt to see which would be the most appropriate.
  2. Yes, but the current Alertmanager design is that alerts should continue to work even if all but one Alertmanager is down. If we add consensus then we need N/2+1 Alertmanagers to be up at all times.

grobinson-grafana avatar May 09 '24 12:05 grobinson-grafana

@grobinson-grafana for 1) I can try implementing it by myself, but I'm not not sure if I can manage 2) or if it's even feasible.

freak12techno avatar May 09 '24 16:05 freak12techno

Hi! :wave: Do you have time to evaluate some embedded k/v stores? That would be a fantastic contribution as we have discussed durable storage for Alertmanager in the past but haven't decided what to use.

For example, I know that Grafana Loki uses bbolt, but it would be nice to see a comparison of some other embedded databases. You could even include sqlite3. Alertmanager has avoided being dependent on other processes as it needs to operate even when these are unavailable, so that means no MySQL, PostgreSQL, memcache, redis, etc.

Second, it is not uncommon for users to have Alertmanager installations with 10,000s of alerts, so it would be nice to see some performance comparisons of different databases. I expect the workload to be write-heavy as reads will only happen at startup time.

grobinson-grafana avatar May 10 '24 10:05 grobinson-grafana

@grobinson-grafana so I looked a bit into how it's done for silences. Apparently it's all serialised into some binary format and stored on disk as a single file. Do you think it makes sense to do it the same way for alerts here as well, or would it be better to do it via a proper db? Basically we only need to read from it once to load all the alerts when starting Alertmanager and to write to it once an alert is created/updated.

(One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.)

freak12techno avatar May 10 '24 11:05 freak12techno

One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.

Yes! That's the issue! :) It works for silences because silences are not created very often and you don't tend to have very many of them. But alerts are very different, and Alertmanager can be receiving 1000s of alerts per minute (i.e. the EndsAt timestamp needs to be updated to stop firing alerts from resolving).

grobinson-grafana avatar May 10 '24 11:05 grobinson-grafana

@grobinson-grafana okay, from my point of view, sqlite3 here doesn't make a lot of sense as it adds another layer of complexity by having to deal with db schema, so I think this won't be the best approach here.

From other kv databases, other than bbold that you've suggested, one cool option I found is https://github.com/dgraph-io/badger - it has quite a big community (it has more github stars than bbolt) is used by a lot of projects and seems to be maintained. I haven't used either of this in my projects, so I can mostly look at the library popularity and if it's maintained - both seem cool with it.

What do you think?

freak12techno avatar May 11 '24 23:05 freak12techno

Just as a quick note: there's kthxbye which automatically extends silences (prefixed by "ACK!" in the default configuration) that are still firing before they expire While you still won't receive a resolved notification this way, you can set the silence duration to a shorter span and don't need to take care of a long-running silence.

schustersv avatar Jun 07 '24 06:06 schustersv

Can we please at least have Silence START/END notifications? Those won't require states, would they? (unless we want to survive "end of silence during restart" situation). Will save lots of medical expenses for many people. Possibly lives. (related: #2690 )

andrew-phi avatar Sep 06 '24 10:09 andrew-phi