alertmanager
alertmanager copied to clipboard
Alertmanager HA cluster rollout leads to huge number of refiring alerts
We are struggling since a long time with refiring alerts into our backend systems when our Alertmanager HA cluster reboots (for example, due to a normal rollout, or due to Kubernetes node upgrades and the required evictions.
I attached a diagram showing the active, suppressed, and unprocessed alerts during a typical reboot, together with the notification rate during the rollout.
What is surprising to us is that the cluster seems to take a long time (up to 15+ minutes in some cases) to achieve stability again following the reboot of one pod. We can see from the notification rate that our backend - in this case a custom ticketing system - is getting absolutely hammered by refiring events. There seems to be a mix of notifications that should be suppressed (and thus never end up as ticket events) as well as events that have already been notified, but get erroneously notified again.
We are at a loss for how to help resolve these issues, and would really appreciate some guidance for how/what we should tune in our cluster to help reduce the effect on our backends.
Technical details below.
Version: v0.26 (with a custom patch that includes https://github.com/prometheus/alertmanager/pull/3419) Deployment mode: Prometheus Operator, config below Alertmanager config: varied. Typical values for alerts that are refiring
group_wait: 1m
group_interval: 1m15s
repeat_interval: 30d
Alertmanager: uses mainly default settings, but in HA mode:
alertmanagerConfigNamespaceSelector: {}
alertmanagerConfigSelector:
matchLabels:
alertmanagerConfig: obs
alertmanagerConfiguration:
name: obs-alerts-base
automountServiceAccountToken: true
clusterPushpullInterval: 5s
containers:
- name: alertmanager
readinessProbe:
initialDelaySeconds: 60
externalUrl: https://alertmgr.osdp.open.ch
image: alertmanager-linux-amd64:0.26.0-pr3419
listenLocal: false
logFormat: logfmt
logLevel: info
paused: false
podMetadata:
annotations:
traffic.sidecar.istio.io/excludeInboundPorts: "9094"
traffic.sidecar.istio.io/excludeOutboundPorts: "9094"
portName: http-web
replicas: 3
resources:
limits:
memory: 8Gi
requests:
cpu: 2000m
memory: 4Gi
retention: 2160h
routePrefix: /
securityContext:
fsGroup: 2000
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
serviceAccountName: obs-monitoring-alertmanager
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: default
version: 0.26.0-pr3419
How are these alerts suppressed? Is it with silences or with inhibition rules? If the latter, this graph might make some sense to me if, following a restart of Alertmanager, the inhibited alerts are arriving too late, causing alerts that were inhibited before the restart to send a notification.
Just wanted to check: you have configured --storage.path to a persistent directory and the --cluster.peer flags so they are all talking to each other?
(I think you might be using some meta-configuration but I'm not familiar with any of those)
The notification log should hold alerts that have already been sent to your backend, and hence not re-notify after a restart. Shown in this picture.