alertmanager Alertmanager HA cluster rollout leads to huge number of refiring alerts

We are struggling since a long time with refiring alerts into our backend systems when our Alertmanager HA cluster reboots (for example, due to a normal rollout, or due to Kubernetes node upgrades and the required evictions.

I attached a diagram showing the active, suppressed, and unprocessed alerts during a typical reboot, together with the notification rate during the rollout.

What is surprising to us is that the cluster seems to take a long time (up to 15+ minutes in some cases) to achieve stability again following the reboot of one pod. We can see from the notification rate that our backend - in this case a custom ticketing system - is getting absolutely hammered by refiring events. There seems to be a mix of notifications that should be suppressed (and thus never end up as ticket events) as well as events that have already been notified, but get erroneously notified again.

We are at a loss for how to help resolve these issues, and would really appreciate some guidance for how/what we should tune in our cluster to help reduce the effect on our backends.

Technical details below.

Version: v0.26 (with a custom patch that includes https://github.com/prometheus/alertmanager/pull/3419) Deployment mode: Prometheus Operator, config below Alertmanager config: varied. Typical values for alerts that are refiring

group_wait: 1m
group_interval: 1m15s
repeat_interval: 30d

Alertmanager: uses mainly default settings, but in HA mode:

  alertmanagerConfigNamespaceSelector: {}
  alertmanagerConfigSelector:
    matchLabels:
      alertmanagerConfig: obs
  alertmanagerConfiguration:
    name: obs-alerts-base
  automountServiceAccountToken: true
  clusterPushpullInterval: 5s
  containers:
  - name: alertmanager
    readinessProbe:
      initialDelaySeconds: 60
  externalUrl: https://alertmgr.osdp.open.ch
  image: alertmanager-linux-amd64:0.26.0-pr3419
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  paused: false
  podMetadata:
    annotations:
      traffic.sidecar.istio.io/excludeInboundPorts: "9094"
      traffic.sidecar.istio.io/excludeOutboundPorts: "9094"
  portName: http-web
  replicas: 3
  resources:
    limits:
      memory: 8Gi
    requests:
      cpu: 2000m
      memory: 4Gi
  retention: 2160h
  routePrefix: /
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  serviceAccountName: obs-monitoring-alertmanager
  storage:
    volumeClaimTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        storageClassName: default
  version: 0.26.0-pr3419

May 27 '25 09:05 verejoel

How are these alerts suppressed? Is it with silences or with inhibition rules? If the latter, this graph might make some sense to me if, following a restart of Alertmanager, the inhibited alerts are arriving too late, causing alerts that were inhibited before the restart to send a notification.

May 31 '25 19:05 grobinson-grafana

Just wanted to check: you have configured --storage.path to a persistent directory and the --cluster.peer flags so they are all talking to each other?

(I think you might be using some meta-configuration but I'm not familiar with any of those)

The notification log should hold alerts that have already been sent to your backend, and hence not re-notify after a restart. Shown in this picture.

Jul 30 '25 18:07 bboreham

alertmanager alertmanager copied to clipboard

Alertmanager HA cluster rollout leads to huge number of refiring alerts

alertmanager
alertmanager copied to clipboard