Alert PrometheusNotConnectedToAlertmanagers in alerts.libsonnet does not work
File documentation/prometheus-mixin/alerts.libsonnet contains alert PrometheusNotConnectedToAlertmanagers, which uses prometheus_notifications_alertmanagers_discovered.
That alert is not working for me. More information is here:
How to detect lost connection to Alertmanager Saturday the 20th of November 2021 https://groups.google.com/g/prometheus-users/c/vo5PRmu-AA8
From that discussion: Alternative 1) Scrape the Alertmanager and alert on based on up{job="alertmanager"}. Alternative 2) Use an alert based on prometheus_notifications_errors_total.
There is already PrometheusErrorSendingAlertsToSomeAlertmanagers alert which is based on prometheus_notifications_errors_total and prometheus_notifications_sent_total. This one should kick in when prometheus cannot send alerts to alertmanagers.
Having TargetDown alert based on up metric is a good idea for a catch-all alert if anything else fails. But I don't think this should be shipped with prometheus mixin though.
As for PrometheusNotConnectedToAlertmanagers it is best used in conjunction with PrometheusErrorSendingAlertsToSomeAlertmanagers as they treat about different things and together give you the whole picture. The former treats about discovery of alertmanagers, whereas the latter is about actual alert sending.
Regarding to PrometheusNotConnectedToAlertmanagers, what do you mean by discovery?
When I stop the prometheus-alertmanager service before starting the prometheus service, the Alertmanager is still "discovered" by Prometheus. If the Alertmanager was running, and I stop it, the Alertmanager remains "discovered" by Prometheus. What does that metric actually measure?
According to metric description, it measures "The number of alertmanagers discovered and active." - https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L191-L194. However, looking closer it seems to be only looking at Service Disovery aspect and it always discovers alertmanagers that are statically defined. This could be considered a bug by some, but at the same time if we assume this metric is only about service discovery, then a statically defined endpoint is considered a discovered endpoint regardless of the state of the application behind that endpoint. This in turn can lead to the results which you described.
Regardless, I think that either metric description should be changed (metric does not reflect if discovered alertmanager is active) or the logic handling that metric value should be changed to reflect if the discovered alertmanager is active.
Hello from the bug scrub!
@beorn7 (which is me :duck:) dropped the ball on this, but by now, he isn't maintaining the mixins anymore. @metalmatze we thought you should have a look and make a call.
Reading through this, I don't have a clear path forward either.
I guess since the PrometheusNotConnectedToAlertmanagers can be misleading and we have the PrometheusErrorSendingAlertsToSomeAlertmanagers in the mixins, I would propose to remove PrometheusNotConnectedToAlertmanagers in favor of the other one.
What do others think?
From a quick glance, it looks like we should rename PrometheusNotConnectedToAlertmanagers to something like PrometheusNoAlertmanagersDiscovered. If any further descriptions are unclear, they should be updated.