mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Docs: Enable cluster label verification in Helm

Open krajorama opened this issue 3 years ago • 8 comments

Describe the bug

Two separate memberlist gossip rings may join into one when IP addresses are reused between the two due to for example a kubernetes migration restarting a lot of PODs quickly.

To Reproduce

N/A incident RCA exists

Expected behavior

Gossip rings do not interact

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: Helm

Additional Context

https://github.com/grafana/mimir/pull/2349

Requires a three step migration, so major version bump and migration document needed.

krajorama avatar Aug 31 '22 07:08 krajorama

This looks like a docs issue to me. The migration is possible right now, helm doesn't prevent it. We can document how to do the migration and then enable this by default and do a major release.

dimitarvdimitrov avatar Sep 01 '22 13:09 dimitarvdimitrov

This looks like a docs issue to me. The migration is possible right now, helm doesn't prevent it. We can document how to do the migration and then enable this by default and do a major release.

Yes, I agree, I was thinking about how to write a migration path, but it's three steps, not super complicated. And the user can keep backwards compatibility with one setting in mimir.structuredConfig.memberlist.cluster_label_verification_disabled: true which is also the first step to do a migration.

krajorama avatar Sep 01 '22 14:09 krajorama

Please link to this from the (yet to be completed) "Running Helm in production" document

mattmendick avatar Oct 11 '22 15:10 mattmendick

@lamida said he's interested in picking this up, so I'm posting my notes to help pick this up

Breaking change

The reason for doing this migration is because making all changes in one go is a breaking change. The agreement with @krajorama is that we want to eventually have cluster label verification enabled by default in the chart.

We have a deprecation policy of 2 major releases. I think it might make sense to follow the same period for this breaking change. Otherwise, a period of 6 months to a year sounds viable for this. I'm curious to hear what others think.

Migration doc

Because it's a breaking change we need a migration doc. I think it would make sense to have multiple parts for this doc. Would be nice to get feedback from @osg-grafana and other involved if you think this is viable.

  1. intro
    • why we're doing this change and who it applies to. This is basically anyone who is running multiple Mimir, Loki or Tempo (or their enterprise versions) in the same kubernetes cluster. More generally in any cluster where the pods may reuse each other's IPs after churning. Mention the risk of not doing the migration - clusters might merge and start sharding data.
    • An estimation of the effort and time it will take.
    • The release when we will introduce the breaking change.
    • The alternative - what happens if you do all of this in one go without a migration - the risk is that old members of the cluster will not have access to the hash ring. This will be crucial for the new ingesters who will experience a huge resharding. I think this should be brief
  2. More detailed explanation for how this migration solves this problem for folks that are interested to understand what is actually happening
  3. The actual migration
  4. How to verify that the migration has been successful

I think most of the sections are self-explanatory. I have notes on two of them

2. More detailed explanation for how this migration solves this problem for folks that are interested to understand what is actually happening

This is an excerpt from the internal docs for this migration. Might be good to include it in some form or another

The memberlist label filter discards all incoming memberlist messages and streams which the memberlist client receives which have a label that doesn’t match its own configured label. By default the configured label is “” and the label filter is enabled (the option to disable it is disabled).

3. The actual migration

The actual migration has three parts: I've taken them from this blog post and the internal procedure for doing the migration

  1. Explicitly disable the cluster label verification in memberlist - mimir.config.memberlist.cluster_label_verification_disabled: true. Roll out all the whole cluster.
  2. Add the memberlist cluster label to all components. mimir.config.memberlist.cluster_label: "$helm-release-name". Roll out the whole cluster.
  3. Change the verification to true and roll out the cluster again.

Helping with the migration

To help users with the migration we can do step 1. and include it in a helm release. The change is small and non-breaking. I think we should do it. This way the mandatory migration with the major release will be shorter for people that have a relatively recent helm chart version.

dimitarvdimitrov avatar Nov 08 '23 16:11 dimitarvdimitrov

SMEs were @dimitarvdimitrov and @krajorama; SMEs will be @lamida and @francoposa.

osg-grafana avatar Nov 10 '23 14:11 osg-grafana

I just finished the first draft of the documentation in #7961 @krajorama @dimitarvdimitrov .

However, I haven't put anything yet on stating step 1 will be part of our future Mimir helm release.

To help users with the migration we can do step 1. and include it in a helm release.

Do we still want to release a helm chart version to disable cluster label verification to remove one migration step if eventually in the future we will enable cluster label verification again by default in the chart?

lamida avatar Apr 29 '24 08:04 lamida

To help users with the migration we can do step 1. and include it in a helm release.

Do we still want to release a helm chart version to disable cluster label verification to remove one migration step if eventually in the future we will enable cluster label verification again by default in the chart?

I feel like we're past this point. If we have the migration doc already, we can switch on cluster label verification on in the next release.

dimitarvdimitrov avatar Apr 29 '24 15:04 dimitarvdimitrov

About the cluster label itself, I propose cluster_label to include the namespace and the helm release name, because technically you can install the same release name in different namespaces.

krajorama avatar Apr 30 '24 07:04 krajorama