alertmanager When the alarm is restored, the unrecovered alarms will be issued together

When the alarm is restored, the unrecovered alarms will be issued together

Open ktpktr0 opened this issue 7 months ago • 0 comments

What did you do? I use dingtalk as the alarm receiver, and when the alarm is restored, it will send out the unrecovered alarms together, even if these alarms do not reach the alarm interval

What did you expect to see?

Alarm recovery issued separately. When I use version 0.22, it seems to be working properly

What did you see instead? Under which circumstances?

[FIRING: 2] Site time consumption is too high
Alarm level: Warning
Alarm status: Problem
Alarm host: https://pay.xxx.cn/
Trigger time: November 19, 2024 20:01:58
Alarm details:
_ (): Current site time consumption: 1.542s
> https://pay.xxx.cn
Alarm label:
Alerttype: domain
Hostname: 192.168.0.20
Job: Blackbox_ HTTP
Module: http_ 2xx
Service: Mall Access Address

Alarm level: Warning
Alarm status: Problem
Alarm host: https://www.xxx.net
Trigger time: November 19, 2024 19:45:28
Alarm details:
_ (): Current site time consumption: 1.927 seconds
> https://www..net
Alarm label:
Alerttype: domain
Hostname: 192.168.0.72
Job: Blackbox_ HTTP
Module: http_ 2xx
Service: Restaurant System - Stall Cashier End

Alarm level: Warning
Alarm status: OK
Alarm host: https://www.xxx.com/
Trigger time: November 19, 2024 20:01:13
End time: November 19, 2024 20:02:13
Alarm details:
_ (): Current site time consumption: 808.1ms
> https://www.xxx.com/
Alarm label:
Alerttype: domain
Hostname: 192.168.0.127
Job: Blackbox_ HTTP
Module: http_ 2xx
Service: backend

Environment

System information:

insert output of uname -srm here

Linux 4.18.0-305.3.1.el8.x86_64 x86_64

Alertmanager version:

insert output of alertmanager --version here (repeat for each alertmanager version in your cluster, if relevant to the issue)

prom/alertmanager:v0.26.0

Prometheus version:

insert output of prometheus --version here (repeat for each prometheus version in your cluster, if relevant to the issue)

prom/prometheus:v2.47.0

Alertmanager configuration file:

global:
  resolve_timeout: 2m
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'CDDEHUGOTSNVOXQV'
  smtp_hello: '163.com'
  smtp_require_tls: false

templates:
  - "/etc/alertmanager/template/*.tmpl"

route:
  receiver: 'default'
  group_wait: 15s
  group_interval: 2m
  repeat_interval: 1h
  group_by: ['alertname', 'status']
  routes:
  - receiver: 'linux_business'
    group_wait: 15s
    group_interval: 2m
    repeat_interval: 1h
    matchers: [job = 'zt_business', monster != 'warning']

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    #equal: ['alertname', 'instance']
    equal: ['instance']

receivers:
  - name: 'default'
    #email_configs:
      #- to: '[email protected]'
      #  html: '{{ template "email.html" . }}'
      #  headers:
      #    #subject: '{{ template "__subject" . }}'
      #    subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }} {{ if eq .Status "resolved" }}:{{ .Alerts.Resolved | len }}{{ end }}] {{ .CommonLabels.alertname }}'
      #  send_resolved: true
    webhook_configs:
    - url: 'http://192.168.0.20:8060/dingtalk/webhook1/send'
      send_resolved: true
      max_alerts: 30
  - name: 'linux_business'
    #email_configs:
    #- to: '[email protected]'
      #send_resolved: true

Prometheus configuration file:

Logs:

ts=2023-12-27T07:47:25.813Z caller=cluster.go:683 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
ts=2023-12-27T07:47:25.844Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2023-12-27T07:47:25.845Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2023-12-27T07:47:25.848Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9093
ts=2023-12-27T07:47:25.848Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9093
ts=2023-12-27T07:47:27.814Z caller=cluster.go:708 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000835954s
ts=2023-12-27T07:47:35.816Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002971602s
ts=2023-12-27T07:48:18.910Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="default/webhook[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: http://192.168.0.20:8060/dingtalk/webhook1/send: Unable to talk to DingTalk\n"
ts=2023-12-27T07:50:18.756Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="default/webhook[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: http://192.168.0.20:8060/dingtalk/webhook1/send: Unable to talk to DingTalk\n"
ts=2023-12-27T07:52:18.756Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="default/webhook[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: http://192.168.0.20:8060/dingtalk/webhook1/send: Unable to talk to DingTalk\n"
ts=2023-12-27T07:54:19.825Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="default/webhook[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: http://192.168.0.20:8060/dingtalk/webhook1/send: Unable to talk to DingTalk\n"
ts=2023-12-27T08:10:02.063Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2023-12-27T08:10:02.063Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2023-12-27T08:16:48.717Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2023-12-27T08:16:48.717Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2023-12-27T08:22:30.508Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2023-12-27T08:22:30.508Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2024-01-09T08:43:14.328Z caller=main.go:594 level=info msg="Received SIGTERM, exiting gracefully..."
ts=2024-01-09T08:43:14.735Z caller=main.go:245 level=info msg="Starting Alertmanager" version="(version=0.26.0, branch=HEAD, revision=d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)"
ts=2024-01-09T08:43:14.735Z caller=main.go:246 level=info build_context="(go=go1.20.7, platform=linux/amd64, user=root@df8d7debeef4, date=20230824-11:11:58, tags=netgo)"
ts=2024-01-09T08:43:14.736Z caller=cluster.go:186 level=info component=cluster msg="setting advertise address explicitly" addr=172.18.0.2 port=9094
ts=2024-01-09T08:43:14.736Z caller=cluster.go:683 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
ts=2024-01-09T08:43:14.764Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2024-01-09T08:43:14.764Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
ts=2024-01-09T08:43:14.766Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9093
ts=2024-01-09T08:43:14.766Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9093
ts=2024-01-09T08:43:16.737Z caller=cluster.go:708 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000080158s
ts=2024-01-09T08:43:24.739Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.00271836s

Jan 19 '24 12:01 ktpktr0

alertmanager alertmanager copied to clipboard

When the alarm is restored, the unrecovered alarms will be issued together

alertmanager
alertmanager copied to clipboard