alertmanager Duplicate notifications after upgrading to Alertmanager 0.15

trafficstars

After upgrading from Alertmanager 0.13 to 0.15.2 in a cluster of two members we've started receiving double notifications in slack. It used to work flawlessly with 0.13. Weirdly we're receiving the 2 notifications exactly at the same time, they don't seem to be apart by more than a couple of secs.

System information:

Linux pmm-server 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux

Both instances using ntp.

Alertmanager version:

alertmanager, version 0.15.2 (branch: HEAD, revision: d19fae3bae451940b8470abb680cfdd59bfa7cfa) build user: root@3101e5b68a55 build date: 20180814-10:53:39 go version: go1.10.3

Cluster status reports up:

Status Uptime: 2018-09-09T19:03:01.726517546Z Cluster Status Name: 01CPZVEFADF9GE2G9F2CTZZZQ6 Status: ready Peers: Name: 01CPZV0HDRQY5M5TW6FDS31MKS Address: :9094 Name: 01CPZVEFADF9GE2G9F2CTZZZQ6 Address: :9094

Prometheus version:

Irrelevant

Alertmanager configuration file:

global:
  resolve_timeout: 5m
  http_config: {}
  smtp_hello: localhost
  smtp_require_tls: true
  slack_api_url: <secret>
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  hipchat_api_url: https://api.hipchat.com/
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: ops-slack
  group_by:
  - alertname
  - group
  routes:
  - receiver: ops-pager
    match:
      alertname: MySQLDown
  - receiver: ops-pager
    match:
      offline: critical
    continue: true
    routes:
    - receiver: ops-slack
      match:
        offline: critical
  - receiver: ops-pager
    match:
      pager: "yes"
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m
receivers:
- name: ops-slack
  slack_configs:
  - send_resolved: true
    http_config: {}
    api_url: <secret>
    channel: alerts
    username: prometheus
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
      | len }}{{ end }}] {{ .CommonAnnotations.summary }}'
    title_link: '{{ template "slack.default.titlelink" . }}'
    pretext: '{{ template "slack.default.pretext" . }}'
    text: |-
      {{ range .Alerts }}
        *Alert:* {{ .Annotations.summary }} - *{{ .Labels.severity | toUpper }}* on {{ .Labels.instance }}
        *Description:* {{ .Annotations.description }}
        *Details:*
        {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
        {{ end }}
      {{ end }}
    footer: '{{ template "slack.default.footer" . }}'
    fallback: '{{ template "slack.default.fallback" . }}'
    icon_emoji: '{{ template "slack.default.iconemoji" . }}'
    icon_url: http://cdn.rancher.com/wp-content/uploads/2015/05/27094511/prometheus-logo-square.png

Logs: no errors to speak of

level=info ts=2018-09-06T11:10:16.242620478Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.2, branch=HEAD, revision=d19fae3bae451940b8470abb680cfdd59bfa7cfa)"
level=info ts=2018-09-06T11:10:16.242654842Z caller=main.go:175 build_context="(go=go1.10.3, user=root@3101e5b68a55, date=20180814-10:53:39)"
level=info ts=2018-09-06T11:10:16.313588161Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-09-06T11:10:16.313610447Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-09-06T11:10:16.315607053Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-09-06T11:10:18.313944578Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000297466s
level=info ts=2018-09-06T11:10:22.314297199Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=2 now=1 elapsed=6.000647448s
level=info ts=2018-09-06T11:10:30.315059802Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=14.001414171s
level=info ts=2018-09-09T18:55:23.930653016Z caller=main.go:426 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2018-09-09T18:55:25.067197275Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.2, branch=HEAD, revision=d19fae3bae451940b8470abb680cfdd59bfa7cfa)"
level=info ts=2018-09-09T18:55:25.067233709Z caller=main.go:175 build_context="(go=go1.10.3, user=root@3101e5b68a55, date=20180814-10:53:39)"
level=info ts=2018-09-09T18:55:25.128486689Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-09-09T18:55:25.128488742Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-09-09T18:55:25.131985874Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-09-09T18:55:27.128662897Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=3 elapsed=2.000096829s
level=info ts=2018-09-09T18:55:31.128969079Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=3 now=2 elapsed=6.000402722s
level=info ts=2018-09-09T18:55:33.129130021Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=3 before=2 now=1 elapsed=8.000564176s
level=info ts=2018-09-09T18:55:37.129427658Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=5 before=1 now=2 elapsed=12.000855483s
level=info ts=2018-09-09T18:55:45.130073007Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=20.001506309s

Sep 14 '18 12:09 tanji

Are these double-notifications happening consistently?

There's no concensus between alertmanagers -- if they receive the initial alert at different times from a prometheus server, the created alert groups in the different alertmanagers might be out of sync by e.g. a single evaluation interval. If your evaluation interval=15s, and the --cluster.peer-timeout=15s (the default), they could end up sending their notifications at the exact same time.

Sep 14 '18 13:09 stuartnelson3

Yes they're quite consistent. What do you mean by evaluation interval? Is this tunable? Do you recommend to increase peer-timeout?

Sep 18 '18 06:09 tanji

Your logs are indicating some weird behavior:

level=info ts=2018-09-09T18:55:27.128662897Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=3 elapsed=2.000096829s
level=info ts=2018-09-09T18:55:31.128969079Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=3 now=2 elapsed=6.000402722s
level=info ts=2018-09-09T18:55:33.129130021Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=3 before=2 now=1 elapsed=8.000564176s
level=info ts=2018-09-09T18:55:37.129427658Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=5 before=1 now=2 elapsed=12.000855483s

now indicates the number of peers in the cluster. In the first two seconds, your instance connects to two other instances (now=3, three total members), then at 6 seconds there are only two instances in the cluster, at 8 seconds there is just the single node, and then it returns to 2 nodes. Something appears to be weird about your setup -- you state there are only 2 nodes, but the logs show that at one point there are 3, and the connection between them appears to be a bit unstable.

How/where are these alertmanagers deployed?

Sep 18 '18 08:09 stuartnelson3

That just happens when starting alertmanager, no messages are appearing after this, those alertmanagers are deployed in the cloud and are pretty close to each other. Of note, we use a docker image to deploy. here are the configs and startup parameters

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - 9093:9093
      - 9094:9094
    volumes:
      - alertmanager_data:/alertmanager
      - /etc/alertmanager:/etc/alertmanager
    restart: always
    command:
      - --config.file=/etc/alertmanager/config.yml
      - --storage.path=/alertmanager
      - --web.external-url=http://pmm-server-ec2:9093
      - --cluster.peer=pmm-server:9094
      - --cluster.advertise-address=(external IP of the EC2 VM)`

Second AM

    image: prom/alertmanager:latest
    ports:
      - 127.0.0.1:9093:9093
      - 9094:9094
    volumes:
      - alertmanager_data:/alertmanager
      - /etc/alertmanager:/etc/alertmanager
    restart: always
    command:
      - --config.file=/etc/alertmanager/config.yml
      - --storage.path=/alertmanager
      - --web.external-url=https://alertmanager
      - --cluster.peer=pmm-server-ec2:9094
      - --cluster.advertise-address=(external IP of the server):9094

Nothing really rocket science, this setup (with the old mesh protocol) worked without duplicates until I upgraded.

Re. 3 nodes could it be that it's also trying to connect to itself?

Sep 18 '18 09:09 tanji

The connection logs are only written during start-up, they aren't logged if the connection flaps. In the initial cluster connection code, a resolved IP address that equals the instance's IP address is removed from the initial list of instances to connect to (so I'm still curious about this 3 thing).

Can you check the following alertmanager metrics that your prometheus should be scraping?

alertmanager_peer_position - each node should have a single, stable value alertmanager_cluster_members - this shouldn't be flapping between different values alertmanager_cluster_failed_peers - ideally this should be zero, or VERY briefly a non-zero number

Sep 18 '18 09:09 stuartnelson3

We don't scrape those, I'll fix this and will look at metrics.

Sep 18 '18 11:09 tanji

There's something wrong indeed, one node is ok, always sees the the other one the other AM sees its peer flapping all the time and cluster size going from 2 to 1. Is it possible to print more debug levels?

Sep 18 '18 11:09 tanji

--log.level=debug will output more logs

Sep 18 '18 11:09 stuartnelson3

OK, I found it, the 2nd node wasn't announcing itself on the correct address, it used Amazon internal IP instead of external :( it should work better now, of note I'm getting those errors:

alertmanager_1  | level=debug ts=2018-09-18T12:01:34.998491475Z caller=cluster.go:287 component=cluster memberlist="2018/09/18 12:01:34 [WARN] memberlist: Was able to connect to 01CQP8SVW787P0JEVVFEEM33SG but other probes failed, network may be misconfigured\n"

Is ICMP necessary as part of the protocol? I can enable it on AWS, it's disabled by default

Sep 18 '18 12:09 tanji

The ping thingy doesn't seem to play well with Docker networks:

alertmanager_1  | level=debug ts=2018-09-18T12:52:01.276047845Z caller=cluster.go:287 component=cluster memberlist="2018/09/18 12:52:01 [WARN] memberlist: Got ping for unexpected node 01CQP8VVB33XSGRCWM3S7EJGN7 from=172.18.0.1:48699\n"

That node advertises itself on the external IP, so you shouldn't consider this an unexpected ping if the source is the docker network gateway IP

Sep 18 '18 12:09 tanji

Is ICMP necessary as part of the protocol?

I believe only UDP and TCP are used.

That node advertises itself on the external IP, so you shouldn't consider this an unexpected ping if the source is the docker network gateway IP

The connection is made using the resolved address from --cluster.peer; if an unrecognized ipaddr "checks in" the underlying library, memberlist, doesn't like that -- it has to join the cluster first.

Sep 18 '18 15:09 stuartnelson3

Are the two machines in the same VPC? Do they advertise themselves via the external IP, but communicate via the internal IP?

Sep 19 '18 08:09 mxinden

OK, after changing it, we're still having duplicate issues for some reason they happen at larger intervals now. @mxinden the 1st machine is in AWS, the 2nd machine is at a baremetal provider, they communicate over the internet (without problems precedently, as I noted)

Sep 19 '18 10:09 tanji

@tanji are the clustering metrics mentioned above still flaky, or stable? In the latter case, do you have access to the notification payloads and can post them here?

Sep 19 '18 11:09 mxinden

The primary form of gossip between the nodes is done over UDP, which might be getting lost between datacenters.

Sep 19 '18 12:09 stuartnelson3

Yes, the metrics have been stable. What do you mean by notification payloads?

Sep 19 '18 13:09 tanji

I have the same issue, that after upgrading 2 AlertManagers to 0.15.2 version, we're receiving duplicate alerts.

Notable config:

group_wait: 30s
--cluster.peer-timeout=1m

Tuning cluster.peer-timeout values to 15s, 30s or 1m does not help in any way.

Debug log shows this:

caller=cluster.go:287 component=cluster memberlist="2018/09/21 07:21:03 [INFO] memberlist: Marking 01CQXFW2PX58MBA1KVDFHTAACN as failed, suspect timeout reached (0 peer confirmations)\n"
caller=delegate.go:215 component=cluster received=NotifyLeave node=01CQXFW2PX58MBA1KVDFHTAACN addr=xxx.xx.x.xx:9094
caller=cluster.go:439 component=cluster msg="peer left" peer=01CQXFW2PX58MBA1KVDFHTAACN
caller=cluster.go:287 component=cluster memberlist="2018/09/21 07:21:04 [DEBUG] memberlist: Initiating push/pull sync with: xx.x.xx.xx:9094\n"
caller=cluster.go:389 component=cluster msg=reconnect result=success peer= addr=xx.x.xx.xx:9094

I wonder if this can be related that AlertManagers are running in Docker containers with flag --cluster.listen-address=0.0.0.0:9094, --cluster.peer= is set with machines IP address on which containers are running, but AlertManager shows Docker internal IPs. Although prior upgrade, everything was fine.

Some Graphs: screen shot 2018-09-21 at 9 34 27 am

screen shot 2018-09-21 at 9 35 38 am

Sep 21 '18 07:09 apsega

Seems like tuning --cluster.probe-timeout up to 10s does not help.

Sep 21 '18 07:09 apsega

What do you mean by notification payloads?

@tanji sorry for not introducing the terms first. We generally refer to an alert as the request send by Prometheus to Alertmanager and a notification as the request send by Alertmanager to e.g. Slack. Do you have access to the payload of two duplicate notifications of Alertmanager send to Slack?

@apsega which Alertmanager version were you running before? v0.15.1?

Sep 21 '18 07:09 mxinden

@mxinden actually very old release, something like v0.8.x

Sep 21 '18 07:09 apsega

@mxinden do you mean the JSON payload? Unfortunately I am not sure how to access it. Is it logged anywhere?

On the text side the outputs are strictly similar.

Sep 21 '18 07:09 tanji

Seems like downgrading to v0.14.0 solves the issue. Tried downgrading to v0.15.1 and v0.15.0 with no luck. So the issue occurs only from v0.15.0.

Sep 21 '18 08:09 apsega

@apsega your cluster isn't stable, hence the duplicate messages. once your cluster stops flapping it should stop sending the duplicate messages.

I would guess that this is definitely something to do with your setup running in docker containers.

Sep 21 '18 08:09 stuartnelson3

Well, downgrading to v0.14.0 made it stable:

screen shot 2018-09-21 at 12 00 29 pm

Sep 21 '18 09:09 apsega

@apsega 0.14 and 0.15 use different libraries for clustering which explains probably why the behaviors are different. You can try with --log.level=debug to get more details but again, your question would be better answered on the Prometheus users mailing list than here.

Sep 21 '18 09:09 simonpasquier

This is still an issue in 2019, can you let me know how to access the payloads?

Mar 05 '19 07:03 tanji

Hi, i am facing a similar issue in case of webhook receiver. while my other receiver slack seems to work fine. only difference between both is group_inteval and repeat_interval which is less for webhook receiver.

Mar 23 '19 06:03 Deepak1100

Screen Shot 2019-03-28 at 2 01 53 PM Screen Shot 2019-03-28 at 2 02 00 PM

I work with @Deepak1100 and above are the duplicate notifications graphs and peer postions for a duration of 7 days.

Mar 28 '19 08:03 Hashfyre

what is the latency when sending requests to the webhook? is it in the same DC? what are your group_interval and repeat_interval values for the webhook?

the number of duplicated requests decreases between alertmanagers, i.e., am0 sends all messages, am1 duplicates some of the messages, am2 duplicates fewer of the messages. This indicates to me that the request in am0 might be taking longer than the peertimeout + nflog propagation time.

Mar 28 '19 12:03 stuartnelson3

I was having a similar problem when running alertmanager inside a docker container, had to do:

Modify the docker container to start listening at port 9094 udp.
Add firewall rules for 9094 udp connectivity.

Apr 08 '19 16:04 carlosflorencio

alertmanager alertmanager copied to clipboard

Duplicate notifications after upgrading to Alertmanager 0.15

alertmanager
alertmanager copied to clipboard