alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

doc: Propose using Memberlist Keyring to protect a cluster

Open zecke opened this issue 5 years ago • 4 comments

Create a document to propose an easy (implementation and operation) way to protect the production cluster from accidental and unwanted members.

Provide a reference implementation in addition to the design document.

TODO(zecke): Figure out how to test this feature properly.

Signed-off-by: Holger Hans Peter Freyther [email protected]

zecke avatar Mar 21 '20 16:03 zecke

Thanks for your PR. We're looking at adding TLS generally, so don't want to add other auth systems.

brian-brazil avatar Mar 21 '20 18:03 brian-brazil

Thank you for your reply and sorry for being late to the party. I have seen the design document and wanted to propose a more simple design for a narrower problem. If we focus on integrity and authentication (e.g. something provided by an HMAC) and leave out confidentiality (e.g. ignore known plaintext in gossiped message) we end up with a solution orders of magnitude easier to implement and operate.

Going all in on TCP + X509 + TLS is nice but has certain consequences for operating an AM:

  • Certificates will expire "unexpectedly". I listed three major companies not able to renew certificates in time. It's bound to happen for many users/orgs. The failure for AM will be less dramatic as it fails open but is a failure mode never the less.

  • TLS is difficult to implement. Connections must be broken when certificates expire, are revoked... Time needs to be roughly synchronized (not sure what requirement on time we have today).

  • TCP for everything brings wanted and unwanted side-effects. The exposure to head of line blocking is one of them.

zecke avatar Mar 22 '20 10:03 zecke

Thanks for putting work into this and writing a design document.

Just for documentation purposes I am linking the initial issue https://github.com/prometheus/alertmanager/issues/1322 the design doc for Membership over TLS and the corresponding work-in-progress pull request https://github.com/prometheus/alertmanager/pull/1819 here.

TCP for everything brings wanted and unwanted side-effects. The exposure to head of line blocking is one of them.

TCP head of line blocking is happening per connection. Given the low bandwidth usage of the gossip protocol I doubt this would be an issue. Please correct me if I am missing something.

Having a simple solution for the problem of distinct clusters merging would be great. On the contrary I do see the maintenance overhead of eventually maintaining two solutions.

mxinden avatar Mar 22 '20 21:03 mxinden

The TLS based securing of the cluster has been implemented in https://github.com/prometheus/alertmanager/issues/1322 in the meantime (still marked EXPERIMENTAL though).

Regarding certificate handling. A lot has changed here, automatics issuing (think ACME, vault, cert-manager) is much more common today.

@zecke do you still want to pursue this?

TheMeier avatar Nov 03 '25 13:11 TheMeier