Nagstamon icon indicating copy to clipboard operation
Nagstamon copied to clipboard

Handle alertmanager alerts with identical hostname and servicename

Open varac opened this issue 3 years ago • 8 comments

I have 1 firing alert (6bd1c7fe217b28c2) which is logged as active but not shown in the UI. My active filters are: Acknowledged hosts and services and Hosts and services down for maintenance. Even when I disable all filters, it's not shown. Maybe it's because the it's description (The API server is burning too much error budget.) is the same as an already silenced alert, but has a different label ("short": "2h" instead of "short": "6h").

DEBUG: 2021-04-18 08:57:06.817782 DOMAIN1 detection config (map_to_status_information): 'message,summary,description'
DEBUG: 2021-04-18 08:57:06.817841 DOMAIN1 detection config (map_to_hostname): 'pod_name,namespace,instance'
DEBUG: 2021-04-18 08:57:06.817860 DOMAIN1 detection config (map_to_servicename): 'alertname'
DEBUG: 2021-04-18 08:57:06.817886 DOMAIN1 FetchURL: https://alertmanager.DOMAIN1/api/v2/alerts CGI Data: None
DEBUG: 2021-04-18 08:57:06.830410 DOMAIN1 received status code '200' with this content in result.result:
-----------------------------------------------------------------------------------------------------------------------------
[{"annotations":{"message":"This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"},"endsAt":"2021-04-18T07:08:01.069Z","fingerprint":"2e6357288f9e7b4d","receivers":[{"name":"null"}],"startsAt":"2021-04-13T06:29:01.069Z","status":{"inhibitedBy":[],"silencedBy":[],"state":"active"},"updatedAt":"2021-04-18T06:56:01.072Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=vector%281%29\u0026g0.tab=1","labels":{"alertname":"Watchdog","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"none"}},{"annotations":{"description":"The API server is burning too much error budget.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn","summary":"The API server is burning too much error budget."},"endsAt":"2021-04-18T07:06:24.381Z","fingerprint":"6bd1c7fe217b28c2","receivers":[{"name":"email"}],"startsAt":"2021-04-18T06:18:24.381Z","status":{"inhibitedBy":[],"silencedBy":[],"state":"active"},"updatedAt":"2021-04-18T06:54:24.383Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28apiserver_request%3Aburnrate1d%29+%3E+%283+%2A+0.01%29+and+sum%28apiserver_request%3Aburnrate2h%29+%3E+%283+%2A+0.01%29\u0026g0.tab=1","labels":{"alertname":"KubeAPIErrorBudgetBurn","long":"1d","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning","short":"2h"}},{"annotations":{"description":"The API server is burning too much error budget.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn","summary":"The API server is burning too much error budget."},"endsAt":"2021-04-18T07:06:24.381Z","fingerprint":"7b0814d213f92350","receivers":[{"name":"email"}],"startsAt":"2021-04-13T16:18:24.381Z","status":{"inhibitedBy":[],"silencedBy":["0928514d-9555-4fba-80de-e31c421dc1e1"],"state":"suppressed"},"updatedAt":"2021-04-18T06:54:24.383Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28apiserver_request%3Aburnrate3d%29+%3E+%281+%2A+0.01%29+and+sum%28apiserver_request%3Aburnrate6h%29+%3E+%281+%2A+0.01%29\u0026g0.tab=1","labels":{"alertname":"KubeAPIErrorBudgetBurn","long":"3d","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning","short":"6h"}},{"annotations":{"description":"Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit","summary":"Cluster has overcommitted memory resource requests."},"endsAt":"2021-04-18T07:07:36.809Z","fingerprint":"d98d85c33827c631","receivers":[{"name":"email"}],"startsAt":"2021-04-13T06:37:36.809Z","status":{"inhibitedBy":[],"silencedBy":["63fbf3e0-baa5-4a73-b935-381739089357"],"state":"suppressed"},"updatedAt":"2021-04-18T06:55:36.812Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28namespace%3Akube_pod_container_resource_requests_memory_bytes%3Asum%29+%2F+sum%28kube_node_status_allocatable_memory_bytes%29+%3E+%28count%28kube_node_status_allocatable_memory_bytes%29+-+1%29+%2F+count%28kube_node_status_allocatable_memory_bytes%29\u0026g0.tab=1","labels":{"alertname":"KubeMemoryOvercommit","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning"}},{"annotations":{"description":"Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit","summary":"Cluster has overcommitted CPU resource requests."},"endsAt":"2021-04-18T07:07:36.809Z","fingerprint":"f54b182e59fd8d9e","receivers":[{"name":"email"}],"startsAt":"2021-04-13T06:37:36.809Z","status":{"inhibitedBy":[],"silencedBy":["0ff8c148-49af-49a7-bfeb-f820608822f0"],"state":"suppressed"},"updatedAt":"2021-04-18T06:55:36.811Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28namespace%3Akube_pod_container_resource_requests_cpu_cores%3Asum%29+%2F+sum%28kube_node_status_allocatable_cpu_cores%29+%3E+%28count%28kube_node_status_allocatable_cpu_cores%29+-+1%29+%2F+count%28kube_node_status_allocatable_cpu_cores%29\u0026g0.tab=1","labels":{"alertname":"KubeCPUOvercommit","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning"}}]
-----------------------------------------------------------------------------------------------------------------------------
DEBUG: 2021-04-18 08:57:06.830663 DOMAIN1 processing alert with fingerprint '2e6357288f9e7b4d':
DEBUG: 2021-04-18 08:57:06.830688 DOMAIN1 [2e6357288f9e7b4d]: detected severity from labels 'NONE' -> skipping alert
DEBUG: 2021-04-18 08:57:06.830700 DOMAIN1 processing alert with fingerprint '6bd1c7fe217b28c2':
DEBUG: 2021-04-18 08:57:06.830713 DOMAIN1 [6bd1c7fe217b28c2]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.830731 DOMAIN1 [6bd1c7fe217b28c2]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.830745 DOMAIN1 [6bd1c7fe217b28c2]: detected servicename from labels: 'KubeAPIErrorBudgetBurn'
DEBUG: 2021-04-18 08:57:06.831293 DOMAIN1 [6bd1c7fe217b28c2]: detected status: 'active'
DEBUG: 2021-04-18 08:57:06.831729 DOMAIN1 processing alert with fingerprint '7b0814d213f92350':
DEBUG: 2021-04-18 08:57:06.831753 DOMAIN1 [7b0814d213f92350]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.831769 DOMAIN1 [7b0814d213f92350]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.831783 DOMAIN1 [7b0814d213f92350]: detected servicename from labels: 'KubeAPIErrorBudgetBurn'
DEBUG: 2021-04-18 08:57:06.832176 DOMAIN1 [7b0814d213f92350]: detected status: 'suppressed' -> interpreting as silenced
DEBUG: 2021-04-18 08:57:06.832558 DOMAIN1 processing alert with fingerprint 'd98d85c33827c631':
DEBUG: 2021-04-18 08:57:06.832583 DOMAIN1 [d98d85c33827c631]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.832599 DOMAIN1 [d98d85c33827c631]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.832613 DOMAIN1 [d98d85c33827c631]: detected servicename from labels: 'KubeMemoryOvercommit'
DEBUG: 2021-04-18 08:57:06.832982 DOMAIN1 [d98d85c33827c631]: detected status: 'suppressed' -> interpreting as silenced
DEBUG: 2021-04-18 08:57:06.833343 DOMAIN1 processing alert with fingerprint 'f54b182e59fd8d9e':
DEBUG: 2021-04-18 08:57:06.833366 DOMAIN1 [f54b182e59fd8d9e]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.833382 DOMAIN1 [f54b182e59fd8d9e]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.833395 DOMAIN1 [f54b182e59fd8d9e]: detected servicename from labels: 'KubeCPUOvercommit'
DEBUG: 2021-04-18 08:57:06.833799 DOMAIN1 [f54b182e59fd8d9e]: detected status: 'suppressed' -> interpreting as silenced

Here's an example screenshot taken later with 2 alerts with identical hostname and servicename, where only one is shown in Nagstamon:

image

varac avatar May 06 '21 06:05 varac

Btw, those are alerts which are automatically configured by installing kube-prometheus using the kube-prometheus-stack helm chart. So I have no control in how these alerts are configured, but there a a few alerts like this with identical hostnames and servicenames which only differ by other labels like long="1d" rsp. long="3d".

varac avatar May 06 '21 06:05 varac

Copy & pasting the discussion from https://github.com/HenriWahl/Nagstamon/issues/709#issuecomment-821963677:

@stearz:

The combination hostname and servicename can exist only once per Monitor in Nagstamon.
The second (silenced) alert overwrites the first (firing).

I am not sure how to solve this.

You could try to use some other field for hostname by changing the hostname detection map but that's no guarantee, that it will not occur again.

@varac / @HenriWahl
Do you have any ideas?

@varac:

I'm not familiar with how Nagstamon indexes alerts, but what if Nagstamon uses the alermanager fingerprint as primary index so it can handle alerts which have the same alertnames/description.

varac avatar May 06 '21 06:05 varac

This is basically the last blocker for me to switch to Nagstamon using the alertmanager integration, because at this point I can't rely on Nagstamon to show me all alerts.

varac avatar May 06 '21 06:05 varac

I will take a look at this.

stearz avatar May 24 '21 18:05 stearz

Ok after a little triage I can tell, that this can not be achieved without some general refactoring in Nagstamon.

@HenriWahl Do you have any idea if and how we can support multiple alerts with the same hostname and servicename without breaking things? - I am not sure if switching to a unique_id based index won't hurt the other monitor integrations as I understood Prometheus and Alertmanager are the only Monitors at the moment that produce multiple alerts with the same host and servicename combination.

As an alternative I would suppose, that we introduce another service_name attribute which is only used for visualization. This would give me the opportunity to use a different servicename for Alertmanager internally (e.g. service.name + '_' + service.fingerprint).

stearz avatar May 24 '21 19:05 stearz

@stearz yes, the current scheme comes from a pure Nagios environment, once upon a time... and all the other supported server types have their roots somewhat in the same host-service-idea.

Yes, your idea sounds good.

HenriWahl avatar May 25 '21 06:05 HenriWahl

Seems to be the same problem here : #738

matgn avatar Jun 08 '21 14:06 matgn

I use this patch as a crude workaround:

--- a/Nagstamon/Servers/Alertmanager/alertmanagerserver.py
+++ b/Nagstamon/Servers/Alertmanager/alertmanagerserver.py
@@ -223,7 +223,10 @@ class AlertmanagerServer(GenericServer):
 
                 service = AlertmanagerService()
                 service.host = alert_data['host']
-                service.name = alert_data['name']
+                service.name = alert_data['name'] + " " + alert_data['fingerprint']
                 service.server = alert_data['server']
                 service.status = alert_data['status']
                 service.labels = alert_data['labels']

It makes the service column look a bit weird, but Nagstamon is otherwise fully functional and I can see all alerts.

cure avatar Dec 27 '22 14:12 cure