alertmanager
alertmanager copied to clipboard
AlertManager not sending all alerts to Webhook endpoint.
Hi,
I'm using a webhook receiver for AlertManager to store alerts for pagination etc. For the most part, the webhook seems to be working just fine, but for some alerts, the webhook doesn't seem to receive a POST call at all from AlertManager.
Is there any way to troubleshoot this? For example, a way to trace alertmanager's outgoing HTTP calls to the webhook receiver?
The webhook endpoint is a Rails application server which also logs all incoming traffic, and after investigating, the missing alerts never show up in the logs (a POST request is never received).
- What I expect: All alerts go through to the webhook endpoint
- What I see: Only some alerts make it through to the webhook endpoint (Rails application that logs incoming requests)
I've attached a partial configuration, omitting redundant receivers etc. They're almost all the same.
Thanks,
Environment
-
System information:
Linux 4.14.186-146.268.amzn2.x86_64 x86_64
-
Alertmanager version:
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d) build user: root@dee35927357f build date: 20200617-08:54:02 go version: go1.14.4
* Prometheus version:
```
prometheus, version 2.22.0 (branch: HEAD, revision: 0a7fdd3b76960808c3a91d92267c3d815c1bc354)
build user: root@6321101b2c50
build date: 20201015-12:29:59
go version: go1.15.3
platform: linux/amd64
- Alertmanager configuration file:
global:
resolve_timeout: 5m
http_config: {}
smtp_from: [email protected]
smtp_hello: localhost
smtp_smarthost: smtp.office365.com:587
smtp_auth_username: [email protected]
smtp_auth_password: <secret>
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
receiver: device-alerts.hook
group_by:
- alertname
- uid
- group_id
- stack_name
- tenant_id
- tenant_name
- rule_stack
- rule_tenant
routes:
- receiver: Test Presence Offline Notification Name
match_re:
alertname: ^(Test Presence Offline Alert Name)$
group_id: 460599d4-3c4a-4311-a7d6-bdce6058672a
tenant_name: ^(vle)$
continue: true
repeat_interval: 10y
group_wait: 30s
group_interval: 5m
repeat_interval: 30m
receivers:
- name: device-alerts.hook
webhook_configs:
- send_resolved: true
http_config: {}
url: http://127.0.0.1/v1/webhook
max_alerts: 0
- name: Test Presence Offline Notification Name
email_configs:
- send_resolved: false
to: [email protected]
from: [email protected]
hello: localhost
smarthost: smtp.office365.com:587
auth_username: [email protected]
auth_password: <secret>
headers:
From: [email protected]
Smtp_from: [email protected]
Subject: 'Alert: {{ range .Alerts }}{{ .Labels.device_name }}{{ end }} | {{ range .Alerts }}{{ .Annotations.description }}{{ end }} | {{ range .Alerts }}{{ .Labels.uid }}{{ end }}'
To: [email protected]
X-SES-CONFIGURATION-SET: ses-kibana
html: '{{ template "email.default.html" . }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}Rule: {{ range .Alerts }}{{ .Labels.alertname }}{{ end }}Group: {{ range .Alerts }}{{ .Labels.group_name }}{{ end }}Device Name: {{ range .Alerts }}{{ .Labels.device_name }}{{ end }}Serial Number: {{ range .Alerts }}{{ .Labels.uid }}{{ end }}'
require_tls: true
templates:
- /etc/alertmanager/templates/default.tmpl
Your best bet is to turn on debug logs (--log.level=debug
). How do you know for sure that notifications are missing?
@simonpasquier So, we can see the alerts in prometheus, as well as in alertmanager, so the alerts fire properly. On our webhook application side, we've logged everything, and we notice that not every alert that fires on alertmanager makes its way to our webhook endpoint. We can see the POST requests from alertmanager to our webhook for some of the alerts, but others are completely missing.
Honestly, the only reason we're using the webhook in the first place is because alertmanager doesn't support pagination when querying for alerts/groups. So we're using the webhook to receive all alerts/resolutions and storing them ourselves so we can manually paginate them. Our applications and metrics can generate tens of thousands of alerts which causes requests to alertmanager to sometimes timeout when the payloads are too large.
Your best bet is to turn on debug logs (
--log.level=debug
). How do you know for sure that notifications are missing?
@simonpasquier I've run debug logs on alertmanager, and can confirm that alerts are received by alertmanager, but not sent to the webhook; email integration does get sent though.
Shouldn't all alerts route to the default route (which is set as the webhook)?
level=debug ts=2020-10-30T18:03:04.334Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:^(Jesse Logo Showing Alert Name)$)$\",group_id=~\"^(?:2223-4343-34333)$\",tenant_name=~\"^(?:^(test)$)$\"}:{alertname=\"Jesse Logo Showing Alert Name\", group_id=\"2223-4343-34333\", rule_stack=\"dev\", rule_tenant=\"test\", stack_name=\"dev\", tenant_id=\"1\", tenant_name=\"test\", uid=\"TEST-UNIT-001\"}" msg=flushing alerts="[Jesse Logo Showing Alert Name[d34154b][active]]"
level=debug ts=2020-10-30T18:03:05.592Z caller=notify.go:685 component=dispatcher receiver="IP Show Logo Alert Notif Name" integration=email[0] msg="Notify success" attempts=1
level=debug ts=2020-10-30T18:04:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"
level=debug ts=2020-10-30T18:06:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"
level=debug ts=2020-10-30T18:08:04.334Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:^(Jesse Logo Showing Alert Name)$)$\",group_id=~\"^(?:2223-4343-34333)$\",tenant_name=~\"^(?:^(test)$)$\"}:{alertname=\"Jesse Logo Showing Alert Name\", group_id=\"2223-4343-34333\", rule_stack=\"dev\", rule_tenant=\"test\", stack_name=\"dev\", tenant_id=\"1\", tenant_name=\"test\", uid=\"TEST-UNIT-001\"}" msg=flushing alerts="[Jesse Logo Showing Alert Name[d34154b][active]]"
level=debug ts=2020-10-30T18:08:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"```
I encounter similar issue using same alert manager version 0.21 but our prometheus is on v2.19. In our case, there are some POST requests that are missing even though there are active alerts.
Another issue is that seems some POST requests has missing information.
For example, if there is an active alert containing 5 nodes grouped together.
We will receive 2 POST request. The first one is incomplete because it has missing nodes.
{
"alerts": [
{
...
"instance": "node1.demo.com:9100",
...
"instance": "node2.demo.com:9100",
...
"instance": "node5.demo.com:9100",
...
"status": "firing"
}
],
...
}
And the second POST requests is the complete one. With 5 nodes.
{
"alerts": [
{
...
"instance": "node1.demo.com:9100",
...
"instance": "node2.demo.com:9100",
...
"instance": "node3.demo.com:9100",
...
"instance": "node4.demo.com:9100",
...
"instance": "node5.demo.com:9100",
...
"status": "firing"
}
],
...
}
@andrewipmtl
Shouldn't all alerts route to the default route (which is set as the webhook)?
no, alerts that will match the Test Presence Offline Notification Name
receiver won't go through the top-level route.
@mvineza this seems a different problem.
For example, if there is an active alert containing 5 nodes grouped together.
We will receive 2 POST request. The first one is incomplete because it has missing nodes.
You have 5 alerts then and it may be that they are not sent at the same time by Prometheus.
@andrewipmtl
Shouldn't all alerts route to the default route (which is set as the webhook)?
no, alerts that will match the
Test Presence Offline Notification Name
receiver won't go through the top-level route.
Even though it has the continue flag to 'true' ? Is there any way to make all alerts hit the webhook no matter what?
We have a system where we want to store the alerts so that we can paginate them (webhook) but also only send notifications out for specific ones. Even if we configure an email notification for one of the alerts, we still want it to hit the webhook.
Has this problem been solved? I have encountered the same problem. When grouping alarms, webhook will lose part of the alarms.,My configuration information is as follows
Image: quay.io/prometheus/alertmanager:v0.21.0
route:
receiver: webhook
group_by:
- alertname
routes:
- receiver: webhook
continue: true
group_wait: 30s
group_interval: 1m
repeat_interval: 4h
receivers:
- name: webhook
webhook_configs:
- send_resolved: true
url: http://os-alertmgt-svc.prometheus-monitoring.svc:3000/api/v1/alert/webhook
templates:
- /etc/alertmanager/config/email.tmpl
In the alterManager page, I saw the following alarm. After passing the webhook, I could hardly see the complete alarm
alertname="aa"
4 alerts
alertname="we"
114 alerts
alertname="wewqd"
171 alerts
I've configured the webhook as another route on top of being the default route, and I'm still seeing some alerts not being sent through to the webhook.
@andrewipmtl can you share the new config?
Hello! Am running Prometheus 2.22.0 and Alertmanager v0.16.2 for Openshift Platform monitoring and am also observing some of messages not being sent to webhook endpoint. I do use only one default route for all messages in alertmanager. Alertmanager runs in debug mode so I can easily follows all events. At the webhook endpoint level I log all of events from Alertmanager. Here are my findings:
- Alertmanager always resolve messages (when tailing alertmanages logs) but not always send them to webhook. Looks like around 10-15 % of events are NOT POST'ed to webhook.
- It seems like those alerts which are related to POD's availibility (like TargetDown,KubePodCrashLoop etc..) are those ones which are the most exposed to the issue. (I do use mostly default alerts set from Prometheus Operators from Openshift) Not sure if this observation is correct since those type of alerts are the most frequent...
- There are some of alerts which are always properly resolved. (eg. my bash script for alerts generation, that I ran hundreds of times never resulted in a unresolved message )
- Not sure if this has something to do with alert groupping. Since my alerts volume is low recently I disabled completely Alertsgroupping at alertmanager side (I set group_by: ['...']) to see if this has something to do with the issue or not.
@andrewipmtl can you share the new config?
global:
resolve_timeout: 5m
http_config: {}
smtp_from: [email protected]
smtp_hello: localhost
smtp_smarthost: smtp.office365.com:587
smtp_auth_username: [email protected]
smtp_auth_password: <secret>
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
receiver: device-alerts.hook
group_by:
- alertname
- uid
- group_id
- stack_name
- tenant_id
- tenant_name
- rule_stack
- rule_tenant
routes:
- receiver: device-alerts.hook
match_re:
alertname: .*
continue: true
group_wait: 30s
group_interval: 5m
repeat_interval: 30m
receivers:
- name: device-alerts.hook
webhook_configs:
- send_resolved: true
http_config: {}
url: http://127.0.0.1/v1/webhook
max_alerts: 0
templates:
- /etc/alertmanager/templates/default.tmpl
@andrewipmtl hmm not sure why you configured a subroute. AFAICT this would work the same?
route:
receiver: device-alerts.hook
group_by:
- alertname
- uid
- group_id
- stack_name
- tenant_id
- tenant_name
- rule_stack
- rule_tenant
group_wait: 30s
group_interval: 5m
repeat_interval: 30m
@andrewipmtl hmm not sure why you configured a subroute. AFAICT this would work the same?
route: receiver: device-alerts.hook group_by: - alertname - uid - group_id - stack_name - tenant_id - tenant_name - rule_stack - rule_tenant group_wait: 30s group_interval: 5m repeat_interval: 30m
I've tried it without subroutes either and it still doesn't receive all the alerts via webhook, some still go missing.
Ok not sure why this happens but the only thing I can recommend is to run with --log.level=debug
and investigate what happens when no notification is sent while you expect some.
Ok not sure why this happens but the only thing I can recommend is to run with
--log.level=debug
and investigate what happens when no notification is sent while you expect some.
The exact same thing happens as when I tested it in an earlier debug session: https://github.com/prometheus/alertmanager/issues/2404#issuecomment-719715603
Alerts show up, but aren't sent to the webhook endpoint.
same with me
Facing same.
Forgive me if I misunderstood your initial question, but I think y'all didn't get the point.
The default receiver for some route node (including the top-level node) is only used if your alarm didn't match any matchers declared at that level of the routing tree. Your alarms enter the routing tree from the top and traverses it down until they match some matcher and then that node's receiver receives the alarm.
If you set "continue: true", the alarm will continue matching the siblings, meaning that it will try to match another matcher at the same level.
Therefore, if you want your Webhooks to receive all the alarms, it must be declared properly in combination with "continue: true" in all levels that your alarm matches.
Use amtool to test your routes, as described in prometheus/alertmanager
@rmartinsjr I'm not sure what you mean by sibling routes, all the routes that -should- alert are at the same level, including the one for the webhook, and all routes have continue: true defined, yet I'm still seeing this behavior.
It's also intermittent as some alerts would go through, and many would not. There's no pattern as it does not always seem to be the same alerts randomly passing through to the webhook either.
@andrewipmtl, reviewing all posted configurations, I believe you're using the simpler one that simonpasquier posted... With that supposition, are you sure it isn't the group_by that is grouping multiple alerts into one?
@rmartinsjr, yes I'm sure. The example I posted is a simplified version for demonstration. The actual version has a lot more alerts set up, all with continue: true
defined as parameter as well. We have dozens of alerts set up in this manner configured the same. All the alerts have different naming criteria as well as firing criteria.
Have never seen anything like that... Have you tried the routing tree visual tool? https://www.prometheus.io/webtools/alerting/routing-tree-editor/
@rmartinsjr I have never used it before -- but after using it for the first time just now, I get a "tree" map generated where it looks like every single alert branches from a single node which is the device-alerts.hook. So unless I'm wrong -- every single alert should be hitting the webhook.
In case this helps anyone, I was running AlertManager through prometheus-operator, and I experienced the exact same problem.
In my case the cause was that alert-manager was matching only alert that contained the right namespace label. There is an issue about that in https://github.com/prometheus-operator/prometheus-operator/issues/3737
In case this helps anyone, I was running AlertManager through prometheus-operator, and I experienced the exact same problem.
In my case the cause was that alert-manager was matching only alert that contained the right namespace label. There is an issue about that in prometheus-operator/prometheus-operator#3737
@luislhl , by namespace what exactly do you mean? I have no namespaces defined in my config file, is that the issue? I wasn't aware of any namespace matching if none were provided.
@luislhl , by namespace what exactly do you mean? I have no namespaces defined in my config file, is that the issue? I wasn't aware of any namespace matching if none were provided.
Hey, @andrewipmtl
By namespace I mean a Kubernetes namespace, my bad I didn't make it clearer.
I have deployed Alertmanager in a Kubernetes cluster by using the Prometheus Operator.
The final Alertmanager config I get has this matcher to select only alerts containing a namespace label with value kube-prometheus
:
global:
resolve_timeout: 5m
route:
receiver: "null"
group_by:
- job
routes:
- receiver: kube-prometheus-slack-alerts-slack-alerts-warning
group_by:
- alertname
matchers:
- namespace="kube-prometheus"
[...]
I had some alerts from others namespaces that were ignored because of this matcher. The issue I linked in my previous comment has more info about this behavior.
We have similar issue. Some alerts are not posted to webhook.
And I have a feeling that this is because alert is resolved within group_wait
interval.
As example, group_wait
set to 30s and alert lasts just 20s.
Is that possible?
P.S. Alertmanager v0.21.0, send_resolved
not specified (supposed to be true by default).
same problem, alert manager and prometheus shown the alert but not send data to webhook:
ts=2023-01-08T12:31:28.676Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=InstanceDown[c136526][active]
ts=2023-01-08T12:31:38.677Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}/{}:{job=\"node\"}" msg=flushing alerts=[InstanceDown[c136526][active]]
global:
receivers:
- name: "n8n"
webhook_configs:
- url: https://sample.tld/webhook/alertmanager
send_resolved: true
http_config:
basic_auth:
username: alertmanager
password: securePassword
tls_config:
insecure_skip_verify: true
route:
receiver: n8n
group_by: ['job']
group_wait: 10s
group_interval: 4m
repeat_interval: 2h
routes:
- receiver: n8n
continue: true