alertmanager AlertManager not sending all alerts to Webhook endpoint.

Hi,

I'm using a webhook receiver for AlertManager to store alerts for pagination etc. For the most part, the webhook seems to be working just fine, but for some alerts, the webhook doesn't seem to receive a POST call at all from AlertManager.

Is there any way to troubleshoot this? For example, a way to trace alertmanager's outgoing HTTP calls to the webhook receiver?

The webhook endpoint is a Rails application server which also logs all incoming traffic, and after investigating, the missing alerts never show up in the logs (a POST request is never received).

What I expect: All alerts go through to the webhook endpoint
What I see: Only some alerts make it through to the webhook endpoint (Rails application that logs incoming requests)

I've attached a partial configuration, omitting redundant receivers etc. They're almost all the same.

Thanks,

Environment

System information:

Linux 4.14.186-146.268.amzn2.x86_64 x86_64
Alertmanager version:

alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d) build user: root@dee35927357f build date: 20200617-08:54:02 go version: go1.14.4

* Prometheus version:

	```
prometheus, version 2.22.0 (branch: HEAD, revision: 0a7fdd3b76960808c3a91d92267c3d815c1bc354)
  build user:       root@6321101b2c50
  build date:       20201015-12:29:59
  go version:       go1.15.3
  platform:         linux/amd64

Alertmanager configuration file:


global:
  resolve_timeout: 5m
  http_config: {}
  smtp_from: [email protected]
  smtp_hello: localhost
  smtp_smarthost: smtp.office365.com:587
  smtp_auth_username: [email protected]
  smtp_auth_password: <secret>
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: device-alerts.hook
  group_by:
  - alertname
  - uid
  - group_id
  - stack_name
  - tenant_id
  - tenant_name
  - rule_stack
  - rule_tenant
  routes:
  - receiver: Test Presence Offline Notification Name
    match_re:
      alertname: ^(Test Presence Offline Alert Name)$
      group_id: 460599d4-3c4a-4311-a7d6-bdce6058672a
      tenant_name: ^(vle)$
    continue: true
    repeat_interval: 10y

  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m
receivers:
- name: device-alerts.hook
  webhook_configs:
  - send_resolved: true
    http_config: {}
    url: http://127.0.0.1/v1/webhook
    max_alerts: 0
- name: Test Presence Offline Notification Name
  email_configs:
  - send_resolved: false
    to: [email protected]
    from: [email protected]
    hello: localhost
    smarthost: smtp.office365.com:587
    auth_username: [email protected]
    auth_password: <secret>
    headers:
      From: [email protected]
      Smtp_from: [email protected]
      Subject: 'Alert: {{ range .Alerts }}{{ .Labels.device_name }}{{ end }} | {{ range .Alerts }}{{ .Annotations.description }}{{ end }} | {{ range .Alerts }}{{ .Labels.uid }}{{ end }}'
      To: [email protected]
      X-SES-CONFIGURATION-SET: ses-kibana
    html: '{{ template "email.default.html" . }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}Rule: {{ range .Alerts }}{{ .Labels.alertname }}{{ end }}Group: {{ range .Alerts }}{{ .Labels.group_name }}{{ end }}Device Name: {{ range .Alerts }}{{ .Labels.device_name }}{{ end }}Serial Number: {{ range .Alerts }}{{ .Labels.uid }}{{ end }}'
    require_tls: true
templates:
- /etc/alertmanager/templates/default.tmpl

Oct 27 '20 18:10 andrewipmtl

Your best bet is to turn on debug logs (--log.level=debug). How do you know for sure that notifications are missing?

Oct 28 '20 17:10 simonpasquier

@simonpasquier So, we can see the alerts in prometheus, as well as in alertmanager, so the alerts fire properly. On our webhook application side, we've logged everything, and we notice that not every alert that fires on alertmanager makes its way to our webhook endpoint. We can see the POST requests from alertmanager to our webhook for some of the alerts, but others are completely missing.

Oct 28 '20 17:10 andrewipmtl

Honestly, the only reason we're using the webhook in the first place is because alertmanager doesn't support pagination when querying for alerts/groups. So we're using the webhook to receive all alerts/resolutions and storing them ourselves so we can manually paginate them. Our applications and metrics can generate tens of thousands of alerts which causes requests to alertmanager to sometimes timeout when the payloads are too large.

Oct 28 '20 17:10 andrewipmtl

Your best bet is to turn on debug logs (--log.level=debug). How do you know for sure that notifications are missing?

@simonpasquier I've run debug logs on alertmanager, and can confirm that alerts are received by alertmanager, but not sent to the webhook; email integration does get sent though.

Shouldn't all alerts route to the default route (which is set as the webhook)?

level=debug ts=2020-10-30T18:03:04.334Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:^(Jesse Logo Showing Alert Name)$)$\",group_id=~\"^(?:2223-4343-34333)$\",tenant_name=~\"^(?:^(test)$)$\"}:{alertname=\"Jesse Logo Showing Alert Name\", group_id=\"2223-4343-34333\", rule_stack=\"dev\", rule_tenant=\"test\", stack_name=\"dev\", tenant_id=\"1\", tenant_name=\"test\", uid=\"TEST-UNIT-001\"}" msg=flushing alerts="[Jesse Logo Showing Alert Name[d34154b][active]]"
level=debug ts=2020-10-30T18:03:05.592Z caller=notify.go:685 component=dispatcher receiver="IP Show Logo Alert Notif Name" integration=email[0] msg="Notify success" attempts=1
level=debug ts=2020-10-30T18:04:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"
level=debug ts=2020-10-30T18:06:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"
level=debug ts=2020-10-30T18:08:04.334Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{alertname=~\"^(?:^(Jesse Logo Showing Alert Name)$)$\",group_id=~\"^(?:2223-4343-34333)$\",tenant_name=~\"^(?:^(test)$)$\"}:{alertname=\"Jesse Logo Showing Alert Name\", group_id=\"2223-4343-34333\", rule_stack=\"dev\", rule_tenant=\"test\", stack_name=\"dev\", tenant_id=\"1\", tenant_name=\"test\", uid=\"TEST-UNIT-001\"}" msg=flushing alerts="[Jesse Logo Showing Alert Name[d34154b][active]]"
level=debug ts=2020-10-30T18:08:34.332Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert="Jesse Logo Showing Alert Name[d34154b][active]"```

Oct 30 '20 18:10 andrewipmtl

I encounter similar issue using same alert manager version 0.21 but our prometheus is on v2.19. In our case, there are some POST requests that are missing even though there are active alerts.

Another issue is that seems some POST requests has missing information.

For example, if there is an active alert containing 5 nodes grouped together.

We will receive 2 POST request. The first one is incomplete because it has missing nodes.

{
    "alerts": [
        {
        ...
                "instance": "node1.demo.com:9100",
                ...
                "instance": "node2.demo.com:9100",
                ...
                "instance": "node5.demo.com:9100",
                ...
            "status": "firing"
        }
    ],
    ...
}

And the second POST requests is the complete one. With 5 nodes.

{
    "alerts": [
        {
        ...
                "instance": "node1.demo.com:9100",
                ...
                "instance": "node2.demo.com:9100",
                ...
                "instance": "node3.demo.com:9100",
                ...
                "instance": "node4.demo.com:9100",
                ...
                "instance": "node5.demo.com:9100",
                ...
            "status": "firing"
        }
    ],
    ...
}

Nov 02 '20 08:11 mvineza

@andrewipmtl

Shouldn't all alerts route to the default route (which is set as the webhook)?

no, alerts that will match the Test Presence Offline Notification Name receiver won't go through the top-level route.

Nov 06 '20 15:11 simonpasquier

@mvineza this seems a different problem.

For example, if there is an active alert containing 5 nodes grouped together.

We will receive 2 POST request. The first one is incomplete because it has missing nodes.

You have 5 alerts then and it may be that they are not sent at the same time by Prometheus.

Nov 06 '20 15:11 simonpasquier

@andrewipmtl

Shouldn't all alerts route to the default route (which is set as the webhook)?

no, alerts that will match the Test Presence Offline Notification Name receiver won't go through the top-level route.

Even though it has the continue flag to 'true' ? Is there any way to make all alerts hit the webhook no matter what?

We have a system where we want to store the alerts so that we can paginate them (webhook) but also only send notifications out for specific ones. Even if we configure an email notification for one of the alerts, we still want it to hit the webhook.

Nov 06 '20 20:11 andrewipmtl

Has this problem been solved? I have encountered the same problem. When grouping alarms, webhook will lose part of the alarms.,My configuration information is as follows

Image:         quay.io/prometheus/alertmanager:v0.21.0

route:
  receiver: webhook
  group_by:
  - alertname
  routes:
  - receiver: webhook
    continue: true
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 4h
receivers:
- name: webhook
  webhook_configs:
  - send_resolved: true
    url: http://os-alertmgt-svc.prometheus-monitoring.svc:3000/api/v1/alert/webhook
templates:
- /etc/alertmanager/config/email.tmpl

In the alterManager page, I saw the following alarm. After passing the webhook, I could hardly see the complete alarm

alertname="aa"
4 alerts
alertname="we"
114 alerts
alertname="wewqd"
171 alerts

Nov 10 '20 03:11 linkingli

I've configured the webhook as another route on top of being the default route, and I'm still seeing some alerts not being sent through to the webhook.

Nov 23 '20 19:11 andrewipmtl

@andrewipmtl can you share the new config?

Nov 30 '20 15:11 simonpasquier

Hello! Am running Prometheus 2.22.0 and Alertmanager v0.16.2 for Openshift Platform monitoring and am also observing some of messages not being sent to webhook endpoint. I do use only one default route for all messages in alertmanager. Alertmanager runs in debug mode so I can easily follows all events. At the webhook endpoint level I log all of events from Alertmanager. Here are my findings:

Alertmanager always resolve messages (when tailing alertmanages logs) but not always send them to webhook. Looks like around 10-15 % of events are NOT POST'ed to webhook.
It seems like those alerts which are related to POD's availibility (like TargetDown,KubePodCrashLoop etc..) are those ones which are the most exposed to the issue. (I do use mostly default alerts set from Prometheus Operators from Openshift) Not sure if this observation is correct since those type of alerts are the most frequent...
There are some of alerts which are always properly resolved. (eg. my bash script for alerts generation, that I ran hundreds of times never resulted in a unresolved message )
Not sure if this has something to do with alert groupping. Since my alerts volume is low recently I disabled completely Alertsgroupping at alertmanager side (I set group_by: ['...']) to see if this has something to do with the issue or not.

Dec 08 '20 20:12 0rest

@andrewipmtl can you share the new config?

global:
  resolve_timeout: 5m
  http_config: {}
  smtp_from: [email protected]
  smtp_hello: localhost
  smtp_smarthost: smtp.office365.com:587
  smtp_auth_username: [email protected]
  smtp_auth_password: <secret>
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: device-alerts.hook
  group_by:
  - alertname
  - uid
  - group_id
  - stack_name
  - tenant_id
  - tenant_name
  - rule_stack
  - rule_tenant
  routes:
  - receiver: device-alerts.hook
    match_re:
      alertname: .*
    continue: true
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m
receivers:
- name: device-alerts.hook
  webhook_configs:
  - send_resolved: true
    http_config: {}
    url: http://127.0.0.1/v1/webhook
    max_alerts: 0
templates:
- /etc/alertmanager/templates/default.tmpl

Jan 21 '21 16:01 andrewipmtl

@andrewipmtl hmm not sure why you configured a subroute. AFAICT this would work the same?

route:
  receiver: device-alerts.hook
  group_by:
  - alertname
  - uid
  - group_id
  - stack_name
  - tenant_id
  - tenant_name
  - rule_stack
  - rule_tenant
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m

Jan 29 '21 13:01 simonpasquier

@andrewipmtl hmm not sure why you configured a subroute. AFAICT this would work the same?

route:
  receiver: device-alerts.hook
  group_by:
  - alertname
  - uid
  - group_id
  - stack_name
  - tenant_id
  - tenant_name
  - rule_stack
  - rule_tenant
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m

I've tried it without subroutes either and it still doesn't receive all the alerts via webhook, some still go missing.

Jan 29 '21 14:01 andrewipmtl

Ok not sure why this happens but the only thing I can recommend is to run with --log.level=debug and investigate what happens when no notification is sent while you expect some.

Jan 29 '21 15:01 simonpasquier

Ok not sure why this happens but the only thing I can recommend is to run with --log.level=debug and investigate what happens when no notification is sent while you expect some.

The exact same thing happens as when I tested it in an earlier debug session: https://github.com/prometheus/alertmanager/issues/2404#issuecomment-719715603

Alerts show up, but aren't sent to the webhook endpoint.

Feb 01 '21 16:02 andrewipmtl

same with me

Nov 01 '21 12:11 rodrigoscferraz

Facing same.

Nov 24 '21 11:11 shishirkh

Forgive me if I misunderstood your initial question, but I think y'all didn't get the point.

The default receiver for some route node (including the top-level node) is only used if your alarm didn't match any matchers declared at that level of the routing tree. Your alarms enter the routing tree from the top and traverses it down until they match some matcher and then that node's receiver receives the alarm.

If you set "continue: true", the alarm will continue matching the siblings, meaning that it will try to match another matcher at the same level.

Therefore, if you want your Webhooks to receive all the alarms, it must be declared properly in combination with "continue: true" in all levels that your alarm matches.

Use amtool to test your routes, as described in prometheus/alertmanager

Dec 31 '21 21:12 rmartinsjr

@rmartinsjr I'm not sure what you mean by sibling routes, all the routes that -should- alert are at the same level, including the one for the webhook, and all routes have continue: true defined, yet I'm still seeing this behavior.

It's also intermittent as some alerts would go through, and many would not. There's no pattern as it does not always seem to be the same alerts randomly passing through to the webhook either.

Jan 01 '22 00:01 andrewipmtl

@andrewipmtl, reviewing all posted configurations, I believe you're using the simpler one that simonpasquier posted... With that supposition, are you sure it isn't the group_by that is grouping multiple alerts into one?

Jan 02 '22 18:01 rmartinsjr

@rmartinsjr, yes I'm sure. The example I posted is a simplified version for demonstration. The actual version has a lot more alerts set up, all with continue: true defined as parameter as well. We have dozens of alerts set up in this manner configured the same. All the alerts have different naming criteria as well as firing criteria.

Jan 03 '22 15:01 andrewipmtl

Have never seen anything like that... Have you tried the routing tree visual tool? https://www.prometheus.io/webtools/alerting/routing-tree-editor/

Jan 03 '22 17:01 rmartinsjr

@rmartinsjr I have never used it before -- but after using it for the first time just now, I get a "tree" map generated where it looks like every single alert branches from a single node which is the device-alerts.hook. So unless I'm wrong -- every single alert should be hitting the webhook.

Jan 03 '22 19:01 andrewipmtl

In case this helps anyone, I was running AlertManager through prometheus-operator, and I experienced the exact same problem.

In my case the cause was that alert-manager was matching only alert that contained the right namespace label. There is an issue about that in https://github.com/prometheus-operator/prometheus-operator/issues/3737

May 04 '22 13:05 luislhl

In case this helps anyone, I was running AlertManager through prometheus-operator, and I experienced the exact same problem.

In my case the cause was that alert-manager was matching only alert that contained the right namespace label. There is an issue about that in prometheus-operator/prometheus-operator#3737

@luislhl , by namespace what exactly do you mean? I have no namespaces defined in my config file, is that the issue? I wasn't aware of any namespace matching if none were provided.

May 04 '22 14:05 andrewipmtl

@luislhl , by namespace what exactly do you mean? I have no namespaces defined in my config file, is that the issue? I wasn't aware of any namespace matching if none were provided.

Hey, @andrewipmtl

By namespace I mean a Kubernetes namespace, my bad I didn't make it clearer.

I have deployed Alertmanager in a Kubernetes cluster by using the Prometheus Operator.

The final Alertmanager config I get has this matcher to select only alerts containing a namespace label with value kube-prometheus:

global:
  resolve_timeout: 5m
route:
  receiver: "null"
  group_by:
  - job
  routes:
  - receiver: kube-prometheus-slack-alerts-slack-alerts-warning
    group_by:
    - alertname
    matchers:
    - namespace="kube-prometheus"
[...]

I had some alerts from others namespaces that were ignored because of this matcher. The issue I linked in my previous comment has more info about this behavior.

May 08 '22 03:05 luislhl

We have similar issue. Some alerts are not posted to webhook. And I have a feeling that this is because alert is resolved within group_wait interval. As example, group_wait set to 30s and alert lasts just 20s. Is that possible?

P.S. Alertmanager v0.21.0, send_resolved not specified (supposed to be true by default).

Jun 21 '22 13:06 bsozin

same problem, alert manager and prometheus shown the alert but not send data to webhook:

ts=2023-01-08T12:31:28.676Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=InstanceDown[c136526][active]
ts=2023-01-08T12:31:38.677Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}/{}:{job=\"node\"}" msg=flushing alerts=[InstanceDown[c136526][active]]

global:
receivers:
  - name: "n8n"
    webhook_configs:
      - url: https://sample.tld/webhook/alertmanager
        send_resolved: true
        http_config:
          basic_auth:
            username: alertmanager
            password: securePassword
          tls_config:
            insecure_skip_verify: true
route:
  receiver: n8n
  group_by: ['job']
  group_wait: 10s
  group_interval: 4m
  repeat_interval: 2h
  routes:
  - receiver: n8n
    continue: true

Jan 10 '23 06:01 mhf-ir

alertmanager alertmanager copied to clipboard

AlertManager not sending all alerts to Webhook endpoint.

alertmanager
alertmanager copied to clipboard