Nagstamon
Nagstamon copied to clipboard
Handle alertmanager alerts with identical hostname and servicename
I have 1 firing alert (6bd1c7fe217b28c2
) which is logged as active but not shown in the UI.
My active filters are: Acknowledged hosts and services
and Hosts and services down for maintenance
.
Even when I disable all filters, it's not shown. Maybe it's because the it's description (The API server is burning too much error budget.
) is the same as an already silenced alert, but has a different label ("short": "2h"
instead of "short": "6h").
DEBUG: 2021-04-18 08:57:06.817782 DOMAIN1 detection config (map_to_status_information): 'message,summary,description'
DEBUG: 2021-04-18 08:57:06.817841 DOMAIN1 detection config (map_to_hostname): 'pod_name,namespace,instance'
DEBUG: 2021-04-18 08:57:06.817860 DOMAIN1 detection config (map_to_servicename): 'alertname'
DEBUG: 2021-04-18 08:57:06.817886 DOMAIN1 FetchURL: https://alertmanager.DOMAIN1/api/v2/alerts CGI Data: None
DEBUG: 2021-04-18 08:57:06.830410 DOMAIN1 received status code '200' with this content in result.result:
-----------------------------------------------------------------------------------------------------------------------------
[{"annotations":{"message":"This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"},"endsAt":"2021-04-18T07:08:01.069Z","fingerprint":"2e6357288f9e7b4d","receivers":[{"name":"null"}],"startsAt":"2021-04-13T06:29:01.069Z","status":{"inhibitedBy":[],"silencedBy":[],"state":"active"},"updatedAt":"2021-04-18T06:56:01.072Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=vector%281%29\u0026g0.tab=1","labels":{"alertname":"Watchdog","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"none"}},{"annotations":{"description":"The API server is burning too much error budget.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn","summary":"The API server is burning too much error budget."},"endsAt":"2021-04-18T07:06:24.381Z","fingerprint":"6bd1c7fe217b28c2","receivers":[{"name":"email"}],"startsAt":"2021-04-18T06:18:24.381Z","status":{"inhibitedBy":[],"silencedBy":[],"state":"active"},"updatedAt":"2021-04-18T06:54:24.383Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28apiserver_request%3Aburnrate1d%29+%3E+%283+%2A+0.01%29+and+sum%28apiserver_request%3Aburnrate2h%29+%3E+%283+%2A+0.01%29\u0026g0.tab=1","labels":{"alertname":"KubeAPIErrorBudgetBurn","long":"1d","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning","short":"2h"}},{"annotations":{"description":"The API server is burning too much error budget.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn","summary":"The API server is burning too much error budget."},"endsAt":"2021-04-18T07:06:24.381Z","fingerprint":"7b0814d213f92350","receivers":[{"name":"email"}],"startsAt":"2021-04-13T16:18:24.381Z","status":{"inhibitedBy":[],"silencedBy":["0928514d-9555-4fba-80de-e31c421dc1e1"],"state":"suppressed"},"updatedAt":"2021-04-18T06:54:24.383Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28apiserver_request%3Aburnrate3d%29+%3E+%281+%2A+0.01%29+and+sum%28apiserver_request%3Aburnrate6h%29+%3E+%281+%2A+0.01%29\u0026g0.tab=1","labels":{"alertname":"KubeAPIErrorBudgetBurn","long":"3d","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning","short":"6h"}},{"annotations":{"description":"Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit","summary":"Cluster has overcommitted memory resource requests."},"endsAt":"2021-04-18T07:07:36.809Z","fingerprint":"d98d85c33827c631","receivers":[{"name":"email"}],"startsAt":"2021-04-13T06:37:36.809Z","status":{"inhibitedBy":[],"silencedBy":["63fbf3e0-baa5-4a73-b935-381739089357"],"state":"suppressed"},"updatedAt":"2021-04-18T06:55:36.812Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28namespace%3Akube_pod_container_resource_requests_memory_bytes%3Asum%29+%2F+sum%28kube_node_status_allocatable_memory_bytes%29+%3E+%28count%28kube_node_status_allocatable_memory_bytes%29+-+1%29+%2F+count%28kube_node_status_allocatable_memory_bytes%29\u0026g0.tab=1","labels":{"alertname":"KubeMemoryOvercommit","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning"}},{"annotations":{"description":"Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit","summary":"Cluster has overcommitted CPU resource requests."},"endsAt":"2021-04-18T07:07:36.809Z","fingerprint":"f54b182e59fd8d9e","receivers":[{"name":"email"}],"startsAt":"2021-04-13T06:37:36.809Z","status":{"inhibitedBy":[],"silencedBy":["0ff8c148-49af-49a7-bfeb-f820608822f0"],"state":"suppressed"},"updatedAt":"2021-04-18T06:55:36.811Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28namespace%3Akube_pod_container_resource_requests_cpu_cores%3Asum%29+%2F+sum%28kube_node_status_allocatable_cpu_cores%29+%3E+%28count%28kube_node_status_allocatable_cpu_cores%29+-+1%29+%2F+count%28kube_node_status_allocatable_cpu_cores%29\u0026g0.tab=1","labels":{"alertname":"KubeCPUOvercommit","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning"}}]
-----------------------------------------------------------------------------------------------------------------------------
DEBUG: 2021-04-18 08:57:06.830663 DOMAIN1 processing alert with fingerprint '2e6357288f9e7b4d':
DEBUG: 2021-04-18 08:57:06.830688 DOMAIN1 [2e6357288f9e7b4d]: detected severity from labels 'NONE' -> skipping alert
DEBUG: 2021-04-18 08:57:06.830700 DOMAIN1 processing alert with fingerprint '6bd1c7fe217b28c2':
DEBUG: 2021-04-18 08:57:06.830713 DOMAIN1 [6bd1c7fe217b28c2]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.830731 DOMAIN1 [6bd1c7fe217b28c2]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.830745 DOMAIN1 [6bd1c7fe217b28c2]: detected servicename from labels: 'KubeAPIErrorBudgetBurn'
DEBUG: 2021-04-18 08:57:06.831293 DOMAIN1 [6bd1c7fe217b28c2]: detected status: 'active'
DEBUG: 2021-04-18 08:57:06.831729 DOMAIN1 processing alert with fingerprint '7b0814d213f92350':
DEBUG: 2021-04-18 08:57:06.831753 DOMAIN1 [7b0814d213f92350]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.831769 DOMAIN1 [7b0814d213f92350]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.831783 DOMAIN1 [7b0814d213f92350]: detected servicename from labels: 'KubeAPIErrorBudgetBurn'
DEBUG: 2021-04-18 08:57:06.832176 DOMAIN1 [7b0814d213f92350]: detected status: 'suppressed' -> interpreting as silenced
DEBUG: 2021-04-18 08:57:06.832558 DOMAIN1 processing alert with fingerprint 'd98d85c33827c631':
DEBUG: 2021-04-18 08:57:06.832583 DOMAIN1 [d98d85c33827c631]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.832599 DOMAIN1 [d98d85c33827c631]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.832613 DOMAIN1 [d98d85c33827c631]: detected servicename from labels: 'KubeMemoryOvercommit'
DEBUG: 2021-04-18 08:57:06.832982 DOMAIN1 [d98d85c33827c631]: detected status: 'suppressed' -> interpreting as silenced
DEBUG: 2021-04-18 08:57:06.833343 DOMAIN1 processing alert with fingerprint 'f54b182e59fd8d9e':
DEBUG: 2021-04-18 08:57:06.833366 DOMAIN1 [f54b182e59fd8d9e]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.833382 DOMAIN1 [f54b182e59fd8d9e]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.833395 DOMAIN1 [f54b182e59fd8d9e]: detected servicename from labels: 'KubeCPUOvercommit'
DEBUG: 2021-04-18 08:57:06.833799 DOMAIN1 [f54b182e59fd8d9e]: detected status: 'suppressed' -> interpreting as silenced
Here's an example screenshot taken later with 2 alerts with identical hostname and servicename, where only one is shown in Nagstamon:
Btw, those are alerts which are automatically configured by installing kube-prometheus using the kube-prometheus-stack helm chart. So I have no control in how these alerts are configured, but there a a few alerts like this with identical hostnames and servicenames which only differ by other labels like long="1d"
rsp. long="3d"
.
Copy & pasting the discussion from https://github.com/HenriWahl/Nagstamon/issues/709#issuecomment-821963677:
@stearz:
The combination hostname and servicename can exist only once per Monitor in Nagstamon.
The second (silenced) alert overwrites the first (firing).
I am not sure how to solve this.
You could try to use some other field for hostname by changing the hostname detection map but that's no guarantee, that it will not occur again.
@varac / @HenriWahl
Do you have any ideas?
@varac:
I'm not familiar with how Nagstamon indexes alerts, but what if Nagstamon uses the alermanager fingerprint as primary index so it can handle alerts which have the same alertnames/description.
This is basically the last blocker for me to switch to Nagstamon using the alertmanager integration, because at this point I can't rely on Nagstamon to show me all alerts.
I will take a look at this.
Ok after a little triage I can tell, that this can not be achieved without some general refactoring in Nagstamon.
@HenriWahl Do you have any idea if and how we can support multiple alerts with the same hostname and servicename without breaking things? - I am not sure if switching to a unique_id based index won't hurt the other monitor integrations as I understood Prometheus and Alertmanager are the only Monitors at the moment that produce multiple alerts with the same host and servicename combination.
As an alternative I would suppose, that we introduce another service_name attribute which is only used for visualization. This would give me the opportunity to use a different servicename for Alertmanager internally (e.g. service.name + '_' + service.fingerprint
).
@stearz yes, the current scheme comes from a pure Nagios environment, once upon a time... and all the other supported server types have their roots somewhat in the same host-service-idea.
Yes, your idea sounds good.
Seems to be the same problem here : #738
I use this patch as a crude workaround:
--- a/Nagstamon/Servers/Alertmanager/alertmanagerserver.py
+++ b/Nagstamon/Servers/Alertmanager/alertmanagerserver.py
@@ -223,7 +223,10 @@ class AlertmanagerServer(GenericServer):
service = AlertmanagerService()
service.host = alert_data['host']
- service.name = alert_data['name']
+ service.name = alert_data['name'] + " " + alert_data['fingerprint']
service.server = alert_data['server']
service.status = alert_data['status']
service.labels = alert_data['labels']
It makes the service column look a bit weird, but Nagstamon is otherwise fully functional and I can see all alerts.