alertmanager doc: Propose using Memberlist Keyring to protect a cluster

Create a document to propose an easy (implementation and operation) way to protect the production cluster from accidental and unwanted members.

Provide a reference implementation in addition to the design document.

TODO(zecke): Figure out how to test this feature properly.

Signed-off-by: Holger Hans Peter Freyther [email protected]

Mar 21 '20 16:03 zecke

Thanks for your PR. We're looking at adding TLS generally, so don't want to add other auth systems.

Mar 21 '20 18:03 brian-brazil

Thank you for your reply and sorry for being late to the party. I have seen the design document and wanted to propose a more simple design for a narrower problem. If we focus on integrity and authentication (e.g. something provided by an HMAC) and leave out confidentiality (e.g. ignore known plaintext in gossiped message) we end up with a solution orders of magnitude easier to implement and operate.

Going all in on TCP + X509 + TLS is nice but has certain consequences for operating an AM:

Certificates will expire "unexpectedly". I listed three major companies not able to renew certificates in time. It's bound to happen for many users/orgs. The failure for AM will be less dramatic as it fails open but is a failure mode never the less.
TLS is difficult to implement. Connections must be broken when certificates expire, are revoked... Time needs to be roughly synchronized (not sure what requirement on time we have today).
TCP for everything brings wanted and unwanted side-effects. The exposure to head of line blocking is one of them.

Mar 22 '20 10:03 zecke

Thanks for putting work into this and writing a design document.

Just for documentation purposes I am linking the initial issue https://github.com/prometheus/alertmanager/issues/1322 the design doc for Membership over TLS and the corresponding work-in-progress pull request https://github.com/prometheus/alertmanager/pull/1819 here.

TCP for everything brings wanted and unwanted side-effects. The exposure to head of line blocking is one of them.

TCP head of line blocking is happening per connection. Given the low bandwidth usage of the gossip protocol I doubt this would be an issue. Please correct me if I am missing something.

Having a simple solution for the problem of distinct clusters merging would be great. On the contrary I do see the maintenance overhead of eventually maintaining two solutions.

Mar 22 '20 21:03 mxinden

The TLS based securing of the cluster has been implemented in https://github.com/prometheus/alertmanager/issues/1322 in the meantime (still marked EXPERIMENTAL though).

Regarding certificate handling. A lot has changed here, automatics issuing (think ACME, vault, cert-manager) is much more common today.

@zecke do you still want to pursue this?

Nov 03 '25 13:11 TheMeier