alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Alert resolving not working as expected

Open edenkoveshi opened this issue 2 years ago • 1 comments

What did you do? I am trying to send a "resolved" message for alerts that have been resolved unsusccessfully. It does send them some times but most of the times it doesn't.. I have adjusted the resolve_timeout, group_wait, group_interval and repeat_interval many times but nothing seems to fix the problem.

What did you expect to see? I am expecting to get a "resolved" message at most resolved_timeout time after the alert has been resolved.

Environment Running AlertManager with Prometheus Operator 0.52.1 (this has also happened on 0.40.0) with Alertmanager version v0.23.0. AlertManager is connected to a ThanosRuler with Thanos version v0.21.0 and it sends alerts to a webhook.

  • Alertmanager configuration file:
global:
  resolve_timeout: 3m
route:
  receiver: default
  group_by:
    - alertname
    - namespace
    - instance
    - pod
    - statefulset
    - deployment
    - job_name
    - persistentvolumeclaim
  group_wait: 30s
  group_interval: 2m
  repeat_interval: 1h
  routes:
    - receiver: my-webhook
      match:
        alert: true
receivers:
  - name: default
  - name: my-webhook
    webhook_configs:
       - send_resolved: true
          url: <my-webhook-url>
          max_alerts: 1

edenkoveshi avatar Jul 21 '22 11:07 edenkoveshi

Can anyone help with this?

edenkoveshi avatar Jul 31 '22 08:07 edenkoveshi

@edenkoveshi What do you mean "Resolved unsuccessfully"?

benridley avatar Sep 29 '22 04:09 benridley

I am expecting to get a "resolved" message at most resolved_timeout time after the alert has been resolved.

resolve_timeout will come into play only if the sender didn't provide an end data for the alert. It isn't the case with Prometheus and Thanos Ruler since both set the end date to "eval time + 5m".

simonpasquier avatar Sep 29 '22 14:09 simonpasquier