alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

duplicate repeated notifications

Open brancz opened this issue 6 years ago • 9 comments

While developing some tooling to load test Alertmanager HA clusters, I believe to have found a problem regarding repeated notifications.

The tooling I've built captures the alerts it fires and the resulting notifications as events.

Alerts fired events start the line with ALERTS followed by the timestamp of the event, the Alertmanager instance it was fired against and finishes with a list of alerts represented by their hash.

Notifications received start the line with NOTIFICATION followed by the timestamp, the Alertmanager that sent the notification, the group key, a hash of all alerts in the notification followed by a list of all alerts part of the notification.

The Alertmanager configuration used is:

global:
  resolve_timeout: 60m

route:
  group_by: ['__name__']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 10s
  receiver: 'webhook'
receivers:
- name: 'webhook'
  webhook_configs:
  - url: 'http://127.0.0.1:8080/notify'
    send_resolved: false

The "alerts" I'm sending are just labelsets that I've parsed out of an anonymized set of metrics of a Prometheus metric output, primarily because the dataset existed and is large, but that is why I chose to group by __name__.

The test that I ran looks like this: a single worker sending a single alert every second and every 5 seconds switches to the next alert, this is run for 1 minute + 10 seconds to capture remaining notifications generated from the last alerts fired.

The result was the following (in order to be able to reference line numbers I've also created a gist):

ALERTS       2017-09-28T12:10:37.798194207Z http://localhost:9093/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:37.799183394Z http://localhost:9094/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:37.800069114Z http://localhost:9095/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:38.797270816Z http://localhost:9093/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:38.798020242Z http://localhost:9094/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:38.800534441Z http://localhost:9095/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:39.800521846Z http://localhost:9093/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:39.801290913Z http://localhost:9094/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:39.802074350Z http://localhost:9095/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:40.797564108Z http://localhost:9093/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:40.798459262Z http://localhost:9094/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:40.799239927Z http://localhost:9095/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:41.797429986Z http://localhost:9093/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:41.798258191Z http://localhost:9094/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:41.799001280Z http://localhost:9095/api/v1/alerts  1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:42.797232675Z http://localhost:9093/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:42.797934746Z http://localhost:9094/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:42.798619634Z http://localhost:9095/api/v1/alerts  7e7f12e3cdafe518
NOTIFICATION 2017-09-28T12:10:42.800983970Z http://localhost.localdomain:9095 {}:{__name__="oiwwpnxbzvnglxqfmmgydouluripxyalq"} 1aed0cc371e7fbcf 1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:43.797446311Z http://localhost:9093/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:43.798220343Z http://localhost:9094/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:43.798836356Z http://localhost:9095/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:44.798056203Z http://localhost:9093/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:44.799502208Z http://localhost:9094/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:44.800750221Z http://localhost:9095/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:45.798191810Z http://localhost:9093/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:45.799568964Z http://localhost:9094/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:45.800585195Z http://localhost:9095/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:46.797094850Z http://localhost:9093/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:46.797996715Z http://localhost:9094/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:46.798959840Z http://localhost:9095/api/v1/alerts  7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:47.797074854Z http://localhost:9093/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:47.797671370Z http://localhost:9094/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:47.798175166Z http://localhost:9095/api/v1/alerts  bd89880863e2a021
NOTIFICATION 2017-09-28T12:10:47.799082461Z http://localhost.localdomain:9095 {}:{__name__="vxezaawdsdwcvvuvryyabvkvbgdqlcqstgddkefmpdrjp"} 7e7f12e3cdafe518 7e7f12e3cdafe518
ALERTS       2017-09-28T12:10:48.797726244Z http://localhost:9093/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:48.798648053Z http://localhost:9094/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:48.799661829Z http://localhost:9095/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:49.797847640Z http://localhost:9093/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:49.798974507Z http://localhost:9094/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:49.800244041Z http://localhost:9095/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:50.797584380Z http://localhost:9093/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:50.798515603Z http://localhost:9094/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:50.799385669Z http://localhost:9095/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:51.797519306Z http://localhost:9093/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:51.798570709Z http://localhost:9094/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:51.799592098Z http://localhost:9095/api/v1/alerts  bd89880863e2a021
ALERTS       2017-09-28T12:10:52.798130921Z http://localhost:9093/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:52.799423295Z http://localhost:9094/api/v1/alerts  c7244ae83ae3fea1
NOTIFICATION 2017-09-28T12:10:52.799699291Z http://localhost.localdomain:9095 {}:{__name__="viljjpdisdmychdciattjgryfsxgkinrxuwzkplzqvzydyod"} bd89880863e2a021 bd89880863e2a021
ALERTS       2017-09-28T12:10:52.800667527Z http://localhost:9095/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:53.797671303Z http://localhost:9093/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:53.798821237Z http://localhost:9094/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:53.799933295Z http://localhost:9095/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:54.797521231Z http://localhost:9093/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:54.798485334Z http://localhost:9094/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:54.799259402Z http://localhost:9095/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:55.798953325Z http://localhost:9093/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:55.800176411Z http://localhost:9094/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:55.801245167Z http://localhost:9095/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:56.797100473Z http://localhost:9093/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:56.797971844Z http://localhost:9094/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:56.798780060Z http://localhost:9095/api/v1/alerts  c7244ae83ae3fea1
ALERTS       2017-09-28T12:10:57.798235230Z http://localhost:9093/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:10:57.799724609Z http://localhost:9094/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:10:57.801189828Z http://localhost:9095/api/v1/alerts  52fb295b6100a4e8
NOTIFICATION 2017-09-28T12:10:57.801620247Z http://localhost.localdomain:9095 {}:{__name__="oiwwpnxbzvnglxqfmmgydouluripxyalq"} 1aed0cc371e7fbcf 1aed0cc371e7fbcf
NOTIFICATION 2017-09-28T12:10:57.802131137Z http://localhost.localdomain:9095 {}:{__name__="effdxjecmjwlwywayerjlkbuuzqivrpucvqgqkwoqvnfgxvccl"} c7244ae83ae3fea1 c7244ae83ae3fea1
NOTIFICATION 2017-09-28T12:10:57.803794835Z http://localhost.localdomain:9093 {}:{__name__="oiwwpnxbzvnglxqfmmgydouluripxyalq"} 1aed0cc371e7fbcf 1aed0cc371e7fbcf
ALERTS       2017-09-28T12:10:58.797602698Z http://localhost:9093/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:10:58.798599639Z http://localhost:9094/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:10:58.799491324Z http://localhost:9095/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:10:59.798163002Z http://localhost:9093/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:10:59.799861010Z http://localhost:9094/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:10:59.801677640Z http://localhost:9095/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:11:00.797094927Z http://localhost:9093/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:11:00.797959909Z http://localhost:9094/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:11:00.798928577Z http://localhost:9095/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:11:01.797341069Z http://localhost:9093/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:11:01.798137475Z http://localhost:9094/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:11:01.798849095Z http://localhost:9095/api/v1/alerts  52fb295b6100a4e8
ALERTS       2017-09-28T12:11:02.797696805Z http://localhost:9093/api/v1/alerts  1af4ca728e342bc6
NOTIFICATION 2017-09-28T12:11:02.799762683Z http://localhost.localdomain:9093 {}:{__name__="vxezaawdsdwcvvuvryyabvkvbgdqlcqstgddkefmpdrjp"} 7e7f12e3cdafe518 7e7f12e3cdafe518
ALERTS       2017-09-28T12:11:02.800370351Z http://localhost:9094/api/v1/alerts  1af4ca728e342bc6
NOTIFICATION 2017-09-28T12:11:02.801080811Z http://localhost.localdomain:9095 {}:{__name__="vxezaawdsdwcvvuvryyabvkvbgdqlcqstgddkefmpdrjp"} 7e7f12e3cdafe518 7e7f12e3cdafe518
NOTIFICATION 2017-09-28T12:11:02.802088282Z http://localhost.localdomain:9095 {}:{__name__="zzlqunurqsnprexlidrmgppwemgbhzyigbfgqiyedzsueibqu"} 52fb295b6100a4e8 52fb295b6100a4e8
ALERTS       2017-09-28T12:11:02.802431645Z http://localhost:9095/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:03.797531911Z http://localhost:9093/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:03.798497116Z http://localhost:9094/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:03.799472119Z http://localhost:9095/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:04.797435020Z http://localhost:9093/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:04.798329189Z http://localhost:9094/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:04.799134023Z http://localhost:9095/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:05.797141301Z http://localhost:9093/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:05.797893663Z http://localhost:9094/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:05.798607056Z http://localhost:9095/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:06.797640203Z http://localhost:9093/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:06.798508598Z http://localhost:9094/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:06.799346420Z http://localhost:9095/api/v1/alerts  1af4ca728e342bc6
ALERTS       2017-09-28T12:11:07.797450473Z http://localhost:9093/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:07.798437679Z http://localhost:9094/api/v1/alerts  b00fd90c0a5af067
NOTIFICATION 2017-09-28T12:11:07.798913473Z http://localhost.localdomain:9093 {}:{__name__="viljjpdisdmychdciattjgryfsxgkinrxuwzkplzqvzydyod"} bd89880863e2a021 bd89880863e2a021
ALERTS       2017-09-28T12:11:07.799909084Z http://localhost:9095/api/v1/alerts  b00fd90c0a5af067
NOTIFICATION 2017-09-28T12:11:07.800087283Z http://localhost.localdomain:9095 {}:{__name__="viljjpdisdmychdciattjgryfsxgkinrxuwzkplzqvzydyod"} bd89880863e2a021 bd89880863e2a021
NOTIFICATION 2017-09-28T12:11:07.800997761Z http://localhost.localdomain:9095 {}:{__name__="vxezaawdsdwcvvuvryyabvkvbgdqlcqstgddkefmpdrjp"} 648bd891439bcede 7e7f12e3cdafe518 1af4ca728e342bc6
NOTIFICATION 2017-09-28T12:11:07.801192328Z http://localhost.localdomain:9093 {}:{__name__="vxezaawdsdwcvvuvryyabvkvbgdqlcqstgddkefmpdrjp"} 648bd891439bcede 7e7f12e3cdafe518 1af4ca728e342bc6
NOTIFICATION 2017-09-28T12:11:07.805756040Z http://localhost.localdomain:9093 {}:{__name__="oiwwpnxbzvnglxqfmmgydouluripxyalq"} 1aed0cc371e7fbcf 1aed0cc371e7fbcf
ALERTS       2017-09-28T12:11:08.798768526Z http://localhost:9093/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:08.800274909Z http://localhost:9094/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:08.801675108Z http://localhost:9095/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:09.797689818Z http://localhost:9093/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:09.798762672Z http://localhost:9094/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:09.799715877Z http://localhost:9095/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:10.797560597Z http://localhost:9093/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:10.798576137Z http://localhost:9094/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:10.800462299Z http://localhost:9095/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:11.797450524Z http://localhost:9093/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:11.798319907Z http://localhost:9094/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:11.799047679Z http://localhost:9095/api/v1/alerts  b00fd90c0a5af067
ALERTS       2017-09-28T12:11:12.797160321Z http://localhost:9093/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:12.797774505Z http://localhost:9094/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:12.798617466Z http://localhost:9095/api/v1/alerts  8e4acded38d67e85
NOTIFICATION 2017-09-28T12:11:12.800176434Z http://localhost.localdomain:9095 {}:{__name__="njxugstzcglxwexppqfurzsxezpqvxjjded"} b00fd90c0a5af067 b00fd90c0a5af067
NOTIFICATION 2017-09-28T12:11:12.800200501Z http://localhost.localdomain:9093 {}:{__name__="effdxjecmjwlwywayerjlkbuuzqivrpucvqgqkwoqvnfgxvccl"} c7244ae83ae3fea1 c7244ae83ae3fea1
ALERTS       2017-09-28T12:11:13.798161225Z http://localhost:9093/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:13.799461912Z http://localhost:9094/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:13.800693146Z http://localhost:9095/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:14.797405230Z http://localhost:9093/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:14.798274577Z http://localhost:9094/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:14.799742374Z http://localhost:9095/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:15.797509504Z http://localhost:9093/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:15.798281623Z http://localhost:9094/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:15.799002244Z http://localhost:9095/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:16.797983761Z http://localhost:9093/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:16.799050168Z http://localhost:9094/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:16.800111599Z http://localhost:9095/api/v1/alerts  8e4acded38d67e85
ALERTS       2017-09-28T12:11:17.797050083Z http://localhost:9093/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:17.797719643Z http://localhost:9094/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:17.798317704Z http://localhost:9095/api/v1/alerts  c1acb3495e10a2
NOTIFICATION 2017-09-28T12:11:17.799264901Z http://localhost.localdomain:9095 {}:{__name__="hnzpguwyghtdrzcqdzwad"} 8e4acded38d67e85 8e4acded38d67e85
NOTIFICATION 2017-09-28T12:11:17.799505922Z http://localhost.localdomain:9093 {}:{__name__="zzlqunurqsnprexlidrmgppwemgbhzyigbfgqiyedzsueibqu"} 52fb295b6100a4e8 52fb295b6100a4e8
NOTIFICATION 2017-09-28T12:11:17.802368745Z http://localhost.localdomain:9093 {}:{__name__="vxezaawdsdwcvvuvryyabvkvbgdqlcqstgddkefmpdrjp"} 648bd891439bcede 7e7f12e3cdafe518 1af4ca728e342bc6
NOTIFICATION 2017-09-28T12:11:17.806628738Z http://localhost.localdomain:9093 {}:{__name__="oiwwpnxbzvnglxqfmmgydouluripxyalq"} 1aed0cc371e7fbcf 1aed0cc371e7fbcf
ALERTS       2017-09-28T12:11:18.797415416Z http://localhost:9093/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:18.798154502Z http://localhost:9094/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:18.798871241Z http://localhost:9095/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:19.797547252Z http://localhost:9093/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:19.798858978Z http://localhost:9094/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:19.800230178Z http://localhost:9095/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:20.797779028Z http://localhost:9093/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:20.799083238Z http://localhost:9094/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:20.800423535Z http://localhost:9095/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:21.797995782Z http://localhost:9093/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:21.799232447Z http://localhost:9094/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:21.800423226Z http://localhost:9095/api/v1/alerts  c1acb3495e10a2
ALERTS       2017-09-28T12:11:22.797141634Z http://localhost:9093/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:22.797942369Z http://localhost:9094/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:22.798616673Z http://localhost:9095/api/v1/alerts  4639f01bcd88a94a
NOTIFICATION 2017-09-28T12:11:22.798907736Z http://localhost.localdomain:9095 {}:{__name__="kvqnqvendriqjccoxlekdkgacndbsnovmart"} c1acb3495e10a2 c1acb3495e10a2
NOTIFICATION 2017-09-28T12:11:22.799878579Z http://localhost.localdomain:9094 {}:{__name__="viljjpdisdmychdciattjgryfsxgkinrxuwzkplzqvzydyod"} bd89880863e2a021 bd89880863e2a021
NOTIFICATION 2017-09-28T12:11:22.800155402Z http://localhost.localdomain:9095 {}:{__name__="viljjpdisdmychdciattjgryfsxgkinrxuwzkplzqvzydyod"} bd89880863e2a021 bd89880863e2a021
NOTIFICATION 2017-09-28T12:11:22.800226625Z http://localhost.localdomain:9093 {}:{__name__="viljjpdisdmychdciattjgryfsxgkinrxuwzkplzqvzydyod"} bd89880863e2a021 bd89880863e2a021
NOTIFICATION 2017-09-28T12:11:22.800984398Z http://localhost.localdomain:9095 {}:{__name__="njxugstzcglxwexppqfurzsxezpqvxjjded"} b00fd90c0a5af067 b00fd90c0a5af067
NOTIFICATION 2017-09-28T12:11:22.801429645Z http://localhost.localdomain:9093 {}:{__name__="effdxjecmjwlwywayerjlkbuuzqivrpucvqgqkwoqvnfgxvccl"} c7244ae83ae3fea1 c7244ae83ae3fea1
NOTIFICATION 2017-09-28T12:11:22.802082100Z http://localhost.localdomain:9095 {}:{__name__="effdxjecmjwlwywayerjlkbuuzqivrpucvqgqkwoqvnfgxvccl"} c7244ae83ae3fea1 c7244ae83ae3fea1
ALERTS       2017-09-28T12:11:23.797362116Z http://localhost:9093/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:23.798340936Z http://localhost:9094/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:23.799126613Z http://localhost:9095/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:24.797276171Z http://localhost:9093/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:24.798113499Z http://localhost:9094/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:24.799086892Z http://localhost:9095/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:25.797799754Z http://localhost:9093/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:25.799005231Z http://localhost:9094/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:25.800244586Z http://localhost:9095/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:26.797174189Z http://localhost:9093/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:26.797882146Z http://localhost:9094/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:26.798485464Z http://localhost:9095/api/v1/alerts  4639f01bcd88a94a
ALERTS       2017-09-28T12:11:27.798229822Z http://localhost:9093/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:27.800011256Z http://localhost:9094/api/v1/alerts  a1bebc050a7fc94f
NOTIFICATION 2017-09-28T12:11:27.800128921Z http://localhost.localdomain:9095 {}:{__name__="mjwylwbywspwjuygvvlfzgqgkdgozcpfpvwqnwilfusr"} 4639f01bcd88a94a 4639f01bcd88a94a
NOTIFICATION 2017-09-28T12:11:27.800508342Z http://localhost.localdomain:9095 {}:{__name__="hnzpguwyghtdrzcqdzwad"} 8e4acded38d67e85 8e4acded38d67e85
NOTIFICATION 2017-09-28T12:11:27.801592165Z http://localhost.localdomain:9093 {}:{__name__="zzlqunurqsnprexlidrmgppwemgbhzyigbfgqiyedzsueibqu"} 52fb295b6100a4e8 52fb295b6100a4e8
ALERTS       2017-09-28T12:11:27.802008920Z http://localhost:9095/api/v1/alerts  a1bebc050a7fc94f
NOTIFICATION 2017-09-28T12:11:27.803019601Z http://localhost.localdomain:9095 {}:{__name__="zzlqunurqsnprexlidrmgppwemgbhzyigbfgqiyedzsueibqu"} 52fb295b6100a4e8 52fb295b6100a4e8
NOTIFICATION 2017-09-28T12:11:27.803533788Z http://localhost.localdomain:9093 {}:{__name__="vxezaawdsdwcvvuvryyabvkvbgdqlcqstgddkefmpdrjp"} 648bd891439bcede 7e7f12e3cdafe518 1af4ca728e342bc6
NOTIFICATION 2017-09-28T12:11:27.807676598Z http://localhost.localdomain:9093 {}:{__name__="oiwwpnxbzvnglxqfmmgydouluripxyalq"} 1aed0cc371e7fbcf 1aed0cc371e7fbcf
ALERTS       2017-09-28T12:11:28.797078190Z http://localhost:9093/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:28.797782525Z http://localhost:9094/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:28.798762787Z http://localhost:9095/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:29.797523607Z http://localhost:9093/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:29.798469251Z http://localhost:9094/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:29.799279044Z http://localhost:9095/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:30.797526669Z http://localhost:9093/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:30.798474976Z http://localhost:9094/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:30.799320611Z http://localhost:9095/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:31.797399035Z http://localhost:9093/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:31.798510986Z http://localhost:9094/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:31.799491676Z http://localhost:9095/api/v1/alerts  a1bebc050a7fc94f
ALERTS       2017-09-28T12:11:32.797805946Z http://localhost:9093/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:32.799277870Z http://localhost:9094/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:32.800275768Z http://localhost:9095/api/v1/alerts  f759b670b1c31096
NOTIFICATION 2017-09-28T12:11:32.802986545Z http://localhost.localdomain:9095 {}:{__name__="krzztrbrvnvemygzedveprkgyxplsbbznvrq"} a1bebc050a7fc94f a1bebc050a7fc94f
NOTIFICATION 2017-09-28T12:11:32.802998020Z http://localhost.localdomain:9095 {}:{__name__="effdxjecmjwlwywayerjlkbuuzqivrpucvqgqkwoqvnfgxvccl"} 307dfc988b20ee37 c7244ae83ae3fea1 f759b670b1c31096
ALERTS       2017-09-28T12:11:33.797532933Z http://localhost:9093/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:33.798550985Z http://localhost:9094/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:33.799445253Z http://localhost:9095/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:34.797136900Z http://localhost:9093/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:34.798176352Z http://localhost:9094/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:34.799152471Z http://localhost:9095/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:35.798143531Z http://localhost:9093/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:35.799774171Z http://localhost:9094/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:35.801246510Z http://localhost:9095/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:36.797239527Z http://localhost:9093/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:36.798015926Z http://localhost:9094/api/v1/alerts  f759b670b1c31096
ALERTS       2017-09-28T12:11:36.798760116Z http://localhost:9095/api/v1/alerts  f759b670b1c31096
NOTIFICATION 2017-09-28T12:11:37.798437168Z http://localhost.localdomain:9093 {}:{__name__="kvqnqvendriqjccoxlekdkgacndbsnovmart"} c1acb3495e10a2 c1acb3495e10a2
NOTIFICATION 2017-09-28T12:11:37.799387696Z http://localhost.localdomain:9093 {}:{__name__="njxugstzcglxwexppqfurzsxezpqvxjjded"} b00fd90c0a5af067 b00fd90c0a5af067
NOTIFICATION 2017-09-28T12:11:37.800270884Z http://localhost.localdomain:9095 {}:{__name__="viljjpdisdmychdciattjgryfsxgkinrxuwzkplzqvzydyod"} bd89880863e2a021 bd89880863e2a021
NOTIFICATION 2017-09-28T12:11:37.804813077Z http://localhost.localdomain:9093 {}:{__name__="vxezaawdsdwcvvuvryyabvkvbgdqlcqstgddkefmpdrjp"} 648bd891439bcede 7e7f12e3cdafe518 1af4ca728e342bc6
NOTIFICATION 2017-09-28T12:11:37.808742736Z http://localhost.localdomain:9093 {}:{__name__="oiwwpnxbzvnglxqfmmgydouluripxyalq"} 1aed0cc371e7fbcf 1aed0cc371e7fbcf
NOTIFICATION 2017-09-28T12:11:42.798819444Z http://localhost.localdomain:9093 {}:{__name__="mjwylwbywspwjuygvvlfzgqgkdgozcpfpvwqnwilfusr"} 4639f01bcd88a94a 4639f01bcd88a94a
NOTIFICATION 2017-09-28T12:11:42.799761007Z http://localhost.localdomain:9093 {}:{__name__="hnzpguwyghtdrzcqdzwad"} 8e4acded38d67e85 8e4acded38d67e85
NOTIFICATION 2017-09-28T12:11:42.800567231Z http://localhost.localdomain:9095 {}:{__name__="hnzpguwyghtdrzcqdzwad"} 8e4acded38d67e85 8e4acded38d67e85
NOTIFICATION 2017-09-28T12:11:42.801940804Z http://localhost.localdomain:9094 {}:{__name__="zzlqunurqsnprexlidrmgppwemgbhzyigbfgqiyedzsueibqu"} 52fb295b6100a4e8 52fb295b6100a4e8

We can see that first notifications go out only once, as expected, however when those notifications get repeated, for example on line 67 to 69, they are sent simultaneously by two different Alertmanagers. This happens multiple times throughout the test, sometimes even all three Alertmanager instances send the notification.

I have yet to validate my suspicion in the code, however, I believe this is happening because while for initial "group wait" the Alertmanager instances wait an additional member position in mesh times 5 seconds, I believe they don't have such behavior for repeating notifications, and they race between who manages to sent out and gossip and de-duplicates, which results in sometimes the notifications being de-duplicated and sometimes not.

Let me know what you think and whether anything I said doesn't make any sense :slightly_smiling_face:.

I'll be sharing the tooling I'm building at some point, but it's very early stage and constantly changing right now, but I'm hoping it can become a useful tool to simulate scenarios for HA clusters.

@stuartnelson3 @fabxc (feel free to tag anyone who might be interested as well)

brancz avatar Sep 28 '17 12:09 brancz

Thanks for working on this!

Without having looked at any code to check, your initial hypothesis for repeated notifications not waiting (position*5sec) does sound like it could be the culprit.

What is your priority currently between further refining the tooling and trying to solve this (and probably other) bugs? Are you focused on finishing the tooling and then addressing the bugs, or looking at the bugs as they're discovered?

stuartnelson3 avatar Sep 28 '17 12:09 stuartnelson3

Regarding the tooling I think it'll go hand in hand with the tests we are developing and the bugs we find. The tooling can be shared as is, I just feel at this point it's more confusing then helping at the rate of change happening to it right now (I literally started the first lines of code a few days ago). I should be able to alternate between developing the tooling and fixing the discovered bugs found with it. As in in order to find the more complex bugs the tooling probably needs to get better.

brancz avatar Sep 28 '17 13:09 brancz

@brancz Think this is related to my last comment on the issue https://github.com/prometheus/alertmanager/issues/964.

Prometheus will call Alertmanager for the same issue a couple times. Supposed that 1 call was made, the AMs sync it and send only 1 notification, but prometheus keep calling the AMs a few other times, and as you configured the repeat_interval to a very small time they start resending those notifications because of it.

Does that make any sense?

josedonizetti avatar Oct 09 '17 21:10 josedonizetti

The same thing happens when the repeat_interval is larger. The problem is that once the first notification was sent out all Alertmanager instances will wait until last notification time + repeat_interval, which will make them race against each other regardless of the value of repeat_interval.

I should soon be able to dig into this issue further.

brancz avatar Oct 10 '17 08:10 brancz

@brancz Only one single alertmanager without high available also has the same problem. alertmanager version: 0.8.0 resolve_timeout: 30m

with repeat_interval: 2m resolved message send 6~7 times, with repeat_interval: 7m resolved message send ~3 times.

tangr avatar Oct 13 '17 04:10 tangr

@brancz the "duplicate repeated notifications" should because the prometheus send notification to alertmanager continuously for 15 mins. So if the notification send interval less than 15mins, it will send duplicate repeated? and the send interval time is group_interval + repeat_interval? https://github.com/prometheus/prometheus/blob/master/rules/alerting.go#L151-L156

And it's hardcode. It should be better make it configurable. or change the 15mins to 5mins or 10 mins, 15mins is too long in production.

tangr avatar Oct 17 '17 03:10 tangr

I also have same problem with HA! 2 Prometheus + 2 AM

I have 3 receivers, jira over webhook, slack and pushover.

Alertmanager config for both instances is:

global:
  resolve_timeout: 5m
  
route:
  receiver: jira
  group_by:
  - severity
  routes:
  - receiver: jira
    match:
      severity: Lowest
    continue: true
  - receiver: jira
    match:
      severity: Low
    continue: true
  - receiver: slack
    match:
      severity: Low
    continue: true
  - receiver: jira
    match:
      severity: High
    continue: true
  - receiver: slack
    match:
      severity: High
    continue: true
  - receiver: jira
    match:
      severity: Highest
    continue: true
  - receiver: slack
    match:
      severity: Highest
    continue: true
  - receiver: push
    match:
      severity: Highest
    continue: true
  group_wait: 1m
  group_interval: 15m
  repeat_interval: 1h
receivers:
- name: slack
  slack_configs:
  - send_resolved: false
    api_url: <secret>
    channel: '#prometheus'
    username: '{{ template "slack.default.username" . }}'
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
    title: '{{ .CommonAnnotations.SUMMARY }}'
    title_link: '{{ template "slack.default.titlelink" . }}'
    pretext: '{{ template "slack.default.pretext" . }}'
    text: '{{ .CommonAnnotations.DESCRIPTION }}'
    fallback: '{{ template "slack.default.fallback" . }}'
    icon_emoji: '{{ template "slack.default.iconemoji" . }}'
    icon_url: '{{ template "slack.default.iconurl" . }}'
- name: push
  pushover_configs:
  - send_resolved: true
    user_key: <secret>
    token: <secret>
    title: '{{ template "pushover.default.title" . }}'
    message: '{{ template "pushover.default.message" . }}'
    url: '{{ template "pushover.default.url" . }}'
    priority: '{{ if eq .Status "firing" }}2{{ else }}0{{ end }}'
    retry: 1m0s
    expire: 1h0m0s
- name: jira
  webhook_configs:
  - send_resolved: false
    url: http://127.0.0.1:8000
templates:
- /etc/alertmanager/default.tmpl

Pushover and Slack works good don't send duplicate, but problem is in the webhook, because is that post request. When AM send notification, both AM send many post requests over webhook:

AM1:

127.0.0.1 - - [27/Oct/2017 13:38:42] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [27/Oct/2017 13:38:45] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [27/Oct/2017 13:38:47] "POST / HTTP/1.1" 200 -
....

AM2:

127.0.0.1 - - [27/Oct/2017 13:38:37] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [27/Oct/2017 13:38:39] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [27/Oct/2017 13:38:42] "POST / HTTP/1.1" 200 -
....

EDIT:

I try to disable HA and configure only one Prometheus and Alertmanager but problem is same !

petarkozic avatar Oct 27 '17 11:10 petarkozic

I've played a bit with ambench and indeed I see the same issue using the master branch. Digging further into the code (and with the help of additional traces), I can explain what happens.

Assuming 3 AlertManager instances where am1, am2 and am3 have respectively the position 0, 1, 2 in the cluster. I've set the --cluster.peerTimeout value to 5s meaning that before actually flushing an alert group, am2 waits an extra 5s interval and am3 10s (this is the WaitStage).

The AM configuration is:

global:
  resolve_timeout: 60m

route:
  group_by: ['__name__']
  group_wait: 5s
  group_interval: 20s
  repeat_interval: 40s
  receiver: 'webhook'
receivers:
- name: 'webhook'
  webhook_configs:
  - url: 'http://127.0.0.1:8080/notify'
    send_resolved: false
  1. t=-5s, an alert fires creating a new alert group on all AlertManager.
  2. t=0s, am1, am2, am3 flush the notification. am2 and am3 wait respectively for 5s and 10s before effectively sending the notification.
  3. t=0s, am1 sends the notification and waits for another 20s (group_interval).
  4. t=0s++, the alert group gets another alert from ambench. Since the flush call hasn't finished on am2 and am3, both reset the next timer to zero.
  5. t=5s, am2 sees that am1 has sent the notification. Since the timer has been reset at the step 4, it resets the timer for another 20s (group_interval) and flushes the group.
  6. t=10s, am3 sees that am1 has sent the notification. Since the timer has been reset at the step 4, it resets the timer for another 20s (group_interval) and flushes the group.
  7. t=20s, am1 wakes up, resets the timer for another 20s and flushes the group.
  8. t=25s, am2 wakes up, resets the timer for another 20s and flushes the group (including a 5s wait).
  9. t=30s, am3 wakes up, resets the timer for another 20s and flushes the group (including a 10s wait).

At t=40s, am1 flushes the notification and sends it to the receiver because repeat_interval is over. Almost at the same time, am3 exits from its 10s WaitStage and it will race with am1 to send the notification too.

The same timeline but with a drawing:

image

The bold vertical bars are the flush calls and the green boxes represent the WaitStage intervals.

I've got a couple of ideas to reduce the likelihood of this scenario but it won't probably cover all the cases.

simonpasquier avatar Mar 23 '18 15:03 simonpasquier

I had the same problem, but this occurs only when the configuration file is reloaded when the alarm is sent

lihongmei918 avatar Jun 01 '22 03:06 lihongmei918