alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Feature Request: Keep cluster peer name on restart

Open margau opened this issue 4 years ago • 3 comments

What did you do? Peer

What did you expect to see? When the alertmanager process is restarted, it should not be recognized as the old peer to the cluster. A possible solution would be an manual given peer name (like --cluster.peer-name=), or the use of a deterministic peer name generation.

What did you see instead? Under which circumstances? When the process is restarted for whatever reason, the alertmanager joins as a new peer. The old peer is present as "failed", which is not correct.

Environment

  • System information: n/a

  • Alertmanager version: version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d)

  • Prometheus version: n/a

  • Alertmanager configuration file: n/a

  • Prometheus configuration file: n/a

  • Logs: n/a

margau avatar Jun 07 '20 12:06 margau

This sounds like it should not happen in the first place. Do you know why it is changing?

Overall this feature seems dangerous; it should probably work out of the box.

roidelapluie avatar Jun 28 '20 07:06 roidelapluie

@roidelapluie the peer name has always been auto-generated since the clustering has been refactored to use hashicorp/memberlist. I'm not sure if there was a technical reason for this... I can only assume that not requiring an explicit peer name was deemed easier from a configuration point of view.

https://github.com/prometheus/alertmanager/blob/1895fde85692bd18dca69cbf96a7b67a3e519b22/cluster/cluster.go#L167-L170

simonpasquier avatar Jun 29 '20 11:06 simonpasquier

Wanted to ask about the state of this issue. Recently after restarting our Alertmanagers and looking at the dashboards, we noticed the alertmanager_cluster_failed_peers metric from several peers was non-zero. This caused confusion until we realised that the restart caused this and the cluster was indeed okay, while some information about the old peers was retained. After about 6h the metric went back to zero, this seems to be controlled by --cluster.reconnect-timeout (we didn't override it).

In our case alertmanager_cluster_failed_peers was misleading, so having the option to keep old peer names after a restart would be nice to have.

dmitrime avatar Jul 20 '22 07:07 dmitrime