mimir
mimir copied to clipboard
Docs: Enable cluster label verification in Helm
Describe the bug
Two separate memberlist gossip rings may join into one when IP addresses are reused between the two due to for example a kubernetes migration restarting a lot of PODs quickly.
To Reproduce
N/A incident RCA exists
Expected behavior
Gossip rings do not interact
Environment
- Infrastructure: Kubernetes
- Deployment tool: Helm
Additional Context
https://github.com/grafana/mimir/pull/2349
Requires a three step migration, so major version bump and migration document needed.
This looks like a docs issue to me. The migration is possible right now, helm doesn't prevent it. We can document how to do the migration and then enable this by default and do a major release.
This looks like a docs issue to me. The migration is possible right now, helm doesn't prevent it. We can document how to do the migration and then enable this by default and do a major release.
Yes, I agree, I was thinking about how to write a migration path, but it's three steps, not super complicated. And the user can keep backwards compatibility with one setting in mimir.structuredConfig.memberlist.cluster_label_verification_disabled: true which is also the first step to do a migration.
Please link to this from the (yet to be completed) "Running Helm in production" document
@lamida said he's interested in picking this up, so I'm posting my notes to help pick this up
Breaking change
The reason for doing this migration is because making all changes in one go is a breaking change. The agreement with @krajorama is that we want to eventually have cluster label verification enabled by default in the chart.
We have a deprecation policy of 2 major releases. I think it might make sense to follow the same period for this breaking change. Otherwise, a period of 6 months to a year sounds viable for this. I'm curious to hear what others think.
Migration doc
Because it's a breaking change we need a migration doc. I think it would make sense to have multiple parts for this doc. Would be nice to get feedback from @osg-grafana and other involved if you think this is viable.
- intro
- why we're doing this change and who it applies to. This is basically anyone who is running multiple Mimir, Loki or Tempo (or their enterprise versions) in the same kubernetes cluster. More generally in any cluster where the pods may reuse each other's IPs after churning. Mention the risk of not doing the migration - clusters might merge and start sharding data.
- An estimation of the effort and time it will take.
- The release when we will introduce the breaking change.
- The alternative - what happens if you do all of this in one go without a migration - the risk is that old members of the cluster will not have access to the hash ring. This will be crucial for the new ingesters who will experience a huge resharding. I think this should be brief
- More detailed explanation for how this migration solves this problem for folks that are interested to understand what is actually happening
- The actual migration
- How to verify that the migration has been successful
I think most of the sections are self-explanatory. I have notes on two of them
2. More detailed explanation for how this migration solves this problem for folks that are interested to understand what is actually happening
This is an excerpt from the internal docs for this migration. Might be good to include it in some form or another
The memberlist label filter discards all incoming memberlist messages and streams which the memberlist client receives which have a label that doesn’t match its own configured label. By default the configured label is “” and the label filter is enabled (the option to disable it is disabled).
3. The actual migration
The actual migration has three parts: I've taken them from this blog post and the internal procedure for doing the migration
- Explicitly disable the cluster label verification in memberlist -
mimir.config.memberlist.cluster_label_verification_disabled: true. Roll out all the whole cluster. - Add the memberlist cluster label to all components.
mimir.config.memberlist.cluster_label: "$helm-release-name". Roll out the whole cluster. - Change the verification to
trueand roll out the cluster again.
Helping with the migration
To help users with the migration we can do step 1. and include it in a helm release. The change is small and non-breaking. I think we should do it. This way the mandatory migration with the major release will be shorter for people that have a relatively recent helm chart version.
SMEs were @dimitarvdimitrov and @krajorama; SMEs will be @lamida and @francoposa.
I just finished the first draft of the documentation in #7961 @krajorama @dimitarvdimitrov .
However, I haven't put anything yet on stating step 1 will be part of our future Mimir helm release.
To help users with the migration we can do step 1. and include it in a helm release.
Do we still want to release a helm chart version to disable cluster label verification to remove one migration step if eventually in the future we will enable cluster label verification again by default in the chart?
To help users with the migration we can do step 1. and include it in a helm release.
Do we still want to release a helm chart version to disable cluster label verification to remove one migration step if eventually in the future we will enable cluster label verification again by default in the chart?
I feel like we're past this point. If we have the migration doc already, we can switch on cluster label verification on in the next release.
About the cluster label itself, I propose cluster_label to include the namespace and the helm release name, because technically you can install the same release name in different namespaces.