alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Some alerts going to a webhook crash alertmanager - panic: runtime error: invalid memory address or nil pointer dereference

Open aned opened this issue 1 year ago • 4 comments

What did you see instead? Under which circumstances? Some alerts going to a webhook crash alertmanager

  • Alertmanager version:
alertmanager, version 0.27.0 (branch: HEAD, revision: 0aa3c2aad14cff039931923ab16b26b7481783b5)
  build user:       root@22cd11f671e9
  build date:       20240228-11:51:20
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo
  • Prometheus version:
Version	2.50.1
Revision	8c9b0285360a0b6288d76214a75ce3025bce4050
Branch	HEAD
BuildUser	root@6213bb3ee580
BuildDate	20240226-11:36:26
GoVersion	go1.21.7
  • Logs:
sudo journalctl -u alertmanager -r | more
Apr 03 22:04:10 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:04:10.827Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.00
361905s
Apr 03 22:04:02 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:04:02.825Z caller=cluster.go:708 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=
3 elapsed=2.001042721s
Apr 03 22:04:00 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:04:00.871Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:9999
Apr 03 22:04:00 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:04:00.871Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9999
Apr 03 22:04:00 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:04:00.864Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configurat
ion file" file=/export/apps/prometheus/etc/Conf_Sync/sdi-infra-prometheus/configs_and_rules/alertmanager/alertmanager-config.yml
Apr 03 22:04:00 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:04:00.858Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" fil
e=/export/apps/prometheus/etc/Conf_Sync/sdi-infra-prometheus/configs_and_rules/alertmanager/alertmanager-config.yml
Apr 03 22:04:00 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:04:00.824Z caller=cluster.go:683 level=info component=cluster msg="Waiting for gossip to settle..." interval
=2s
Apr 03 22:03:59 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:03:59.382Z caller=cluster.go:186 level=info component=cluster msg="setting advertise address explicitly" add
r=10.187.4.247 port=9094
Apr 03 22:03:59 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:03:59.377Z caller=featurecontrol.go:94 level=warn msg="Experimental receiver name in metrics enabled"
Apr 03 22:03:59 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:03:59.377Z caller=main.go:182 level=info build_context="(go=go1.21.7, platform=linux/amd64, user=root@22cd11
f671e9, date=20240228-11:51:20, tags=netgo)"
Apr 03 22:03:59 abc1-wer2323.prod.host.blah alertmanager[3823174]: ts=2024-04-03T22:03:59.377Z caller=main.go:181 level=info msg="Starting Alertmanager" version="(version=0.27.0, branch=HEAD,
revision=0aa3c2aad14cff039931923ab16b26b7481783b5)"
Apr 03 22:03:59 abc1-wer2323.prod.host.blah systemd[1]: Started Alertmanager Server.
Apr 03 22:03:59 abc1-wer2323.prod.host.blah systemd[1]: Stopped Alertmanager Server.
Apr 03 22:03:59 abc1-wer2323.prod.host.blah systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 102.
Apr 03 22:03:54 abc1-wer2323.prod.host.blah systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Apr 03 22:03:54 abc1-wer2323.prod.host.blah systemd[1]: alertmanager.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/notify.go:482 +0x9d
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: created by github.com/prometheus/alertmanager/notify.FanoutStage.Exec in goroutine 411
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/notify.go:483 +0x53
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: github.com/prometheus/alertmanager/notify.FanoutStage.Exec.func1({0x1658320?, 0xc0009f92d8?})
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/notify.go:461 +0xd5
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: github.com/prometheus/alertmanager/notify.MultiStage.Exec({0xc00086fb80?, 0x4, 0x14?}, {0x1661110?, 0xc001942480?}, {0x1657c0
0, 0xc000875680}, {0xc0008f2000, 0x81, 0x100})
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/notify.go:760 +0x110
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: github.com/prometheus/alertmanager/notify.RetryStage.Exec({{{0x1657ea0, 0xc000a4b440}, {0x1657e80, 0xc0007ceff0}, {0x10e0c18,
 0x7}, 0x0, {0xc0009e60c0, 0xb}}, {0xc0009e60c0, ...}, ...}, ...)
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/notify.go:836 +0x666
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: github.com/prometheus/alertmanager/notify.RetryStage.exec({{{0x1657ea0, 0xc000a4b440}, {0x1657e80, 0xc0007ceff0}, {0x10e0c18,
 0x7}, 0x0, {0xc0009e60c0, 0xb}}, {0xc0009e60c0, ...}, ...}, ...)
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/notify.go:85
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: github.com/prometheus/alertmanager/notify.(*Integration).Notify(...)
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/webhook/webhook.go:123 +0x527
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: github.com/prometheus/alertmanager/notify/webhook.(*Notifier).Notify(0xc000a4b440, {0x1661110, 0xc001c52090}, {0xc0009a2000?,
 0x2?, 0xc000ce8001?})
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/util.go:239 +0x116
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: github.com/prometheus/alertmanager/notify.(*Retrier).Check(0x1661110?, 0xc001c52090?, {0x1657220, 0xc000b9d600})
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]:         /app/notify/webhook/webhook.go:60 +0x20
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: github.com/prometheus/alertmanager/notify/webhook.New.func1(0x10f7d7e?, {0x1657220, 0xc000b9d600})
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: goroutine 1399 [running]:
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe73160]
Apr 03 22:03:54 abc1-wer2323.prod.host.blah alertmanager[3822203]: panic: runtime error: invalid memory address or nil pointer dereference
Apr 03 22:00:41 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:41.097Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.00
2992536s
Apr 03 22:00:33 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:33.094Z caller=cluster.go:708 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=
3 elapsed=2.000169786s
Apr 03 22:00:31 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:31.144Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:9999
Apr 03 22:00:31 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:31.144Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9999
Apr 03 22:00:31 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:31.139Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configurat
ion file" file=/export/apps/prometheus/etc/Conf_Sync/sdi-infra-prometheus/configs_and_rules/alertmanager/alertmanager-config.yml
Apr 03 22:00:31 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:31.133Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" fil
e=/export/apps/prometheus/etc/Conf_Sync/sdi-infra-prometheus/configs_and_rules/alertmanager/alertmanager-config.yml
Apr 03 22:00:31 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:31.094Z caller=cluster.go:683 level=info component=cluster msg="Waiting for gossip to settle..." interval
=2s
Apr 03 22:00:30 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:30.126Z caller=cluster.go:186 level=info component=cluster msg="setting advertise address explicitly" add
r=10.187.4.247 port=9094
Apr 03 22:00:30 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:30.122Z caller=featurecontrol.go:94 level=warn msg="Experimental receiver name in metrics enabled"
Apr 03 22:00:30 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:30.122Z caller=main.go:182 level=info build_context="(go=go1.21.7, platform=linux/amd64, user=root@22cd11
f671e9, date=20240228-11:51:20, tags=netgo)"
Apr 03 22:00:30 abc1-wer2323.prod.host.blah alertmanager[3822203]: ts=2024-04-03T22:00:30.122Z caller=main.go:181 level=info msg="Starting Alertmanager" version="(version=0.27.0, branch=HEAD,
revision=0aa3c2aad14cff039931923ab16b26b7481783b5)"
Apr 03 22:00:30 abc1-wer2323.prod.host.blah systemd[1]: Started Alertmanager Server.
Apr 03 22:00:30 abc1-wer2323.prod.host.blah systemd[1]: Stopped Alertmanager Server.
Apr 03 22:00:30 abc1-wer2323.prod.host.blah systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 101.
Apr 03 22:00:24 abc1-wer2323.prod.host.blah systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Apr 03 22:00:24 abc1-wer2323.prod.host.blah systemd[1]: alertmanager.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 03 22:00:24 abc1-wer2323.prod.host.blah alertmanager[3820656]:         /app/notify/notify.go:482 +0x9d
Apr 03 22:00:24 abc1-wer2323.prod.host.blah alertmanager[3820656]: created by github.com/prometheus/alertmanager/notify.FanoutStage.Exec in goroutine 346
Apr 03 22:00:24 abc1-wer2323.prod.host.blah alertmanager[3820656]:         /app/notify/notify.go:483 +0x53
Apr 03 22:00:24 abc1-wer2323.prod.host.blah alertmanager[3820656]: github.com/prometheus/alertmanager/notify.FanoutStage.Exec.func1({0x1658320?, 0xc000b0c6f0?})
Apr 03 22:00:24 abc1-wer2323.prod.host.blah alertmanager[3820656]:         /app/notify/notify.go:461 +0xd5

aned avatar Apr 03 '24 22:04 aned

Any chance you have the configuration around?

zecke avatar Apr 04 '24 13:04 zecke

All webhook related config follows this:

- name: 'name_1'
  webhook_configs:
  - url: 'https://url.com:8028/api/v1/alert'
    http_config:
      tls_config:
        insecure_skip_verify: true
    send_resolved: false

- name: 'name_2'
  webhook_configs:
  - url_file: '/export/apps/alertmanager/path'
    send_resolved: false
    max_alerts: 20

aned avatar Apr 04 '24 21:04 aned

A fix for this has just been merged, thanks @zecke! 👍

grobinson-grafana avatar Apr 12 '24 12:04 grobinson-grafana

Fix is here https://github.com/prometheus/alertmanager/pull/3800. Please close the issue 🙂

grobinson-grafana avatar Apr 16 '24 14:04 grobinson-grafana