mimir
mimir copied to clipboard
Helm: multi zone ingesters, store gateways and alertmanager
Is your feature request related to a problem? Please describe.
- It takes a long time to upgrade many ingesters one-by-one with the Mimir Helm chart.
- Mimir Helm chart does not support zone aware replication out of the box.
Describe the solution you'd like
Implement zone aware replication in Helm chart and make it the default (at least for ingesters, TBD for others). Implement upgrade path from previous stable version of the chart. Targeting major version change. Document the migration.
new! It is possible that the user does not have the resources to run the double amount of ingesters, but they cannot do downtime either. We probably need to support single zone for a while, but we should deprecate the single zone setting.
Solution proposal in helm-charts repo: https://github.com/grafana/helm-charts/pull/1205
Describe alternatives you've considered
N/A
Additional context
Prototype exists.
alertmanager
FYI, Mimir jsonnet doesn't support alertmanager deployment in multi-zone. If the primary purposes of multi-zone is to speed up rollouts, then it's non issue for alertmanagers (you typically run a low number of replicas, even in very large clusters).
Whoever picks this up, we mentioned that this may be a big enough issue that it might need two people. Please post in #wg-helm if you're looking for someone else to help out.
Documenting the migration strategy for existing Helm users:
Preparation
Step 1
- [ ] Set the zone for all ingesters to
zone-default
- [ ] Deploy 3 new StatefulSets with 0 replicas.
- [ ] Different zone settings for each
- [ ] rollout-operator annotations
- [ ] OnDelete strategy
- [ ] affinity rule should prevent multiple ingesters of different zones on a single node
- [ ] Multi-zone aware PodDisruptionBudget
- [ ] Scale up series limits to accommodate duplicated series during scale up
Scale up & Enable Multi-zone
Step 2
- [ ] Progressively scale zone-aware ingesters up (~21 at a time across all zones, 7 per zone)
- [ ] Wait for TSDB head compaction and at least on hour after scale up
- [ ] Scale up/out ruler & queriers to accommodate disabling shuffle sharding (TODO: not used in Helm by default)
Step 3
- [ ] Disable ingester shuffle-sharding on the read path (TODO: not used in Helm by default)
- [ ] TEST: rule evaluations are working
Step 4
- [ ] Enable zone-aware replication on the write path (this will reshuffle many series ~40%+ overhead)
- [ ] TEST: rule evaluations are working
- [ ] TEST: observe series overhead and monitor for issues
- [ ] Wait 12h (for query-store-after & shuffle sharding lookback)
Cleanup
Step 5
- [ ] Enable zone-aware replication on the read path
- [ ] Exclude zone-default ingesters from distributors and rulers
- [ ] TEST: zone-default ingesters should have no writes
Step 6
- [ ] Perform a safe scale down of zone-default ingesters, calling shutdown endpoint and unregistering from the ring
- [ ] Wait for TSDB compaction
Step 7
- [ ] Cleanup default ingester statefulset, pvcs, etc.
- [ ] Revert the limits increase
- [ ] Wait 12h since zone-default exclusion
Step 8
- [ ] Re-enable shuffle sharding on the read path (TODO: not used in Helm by default)
I wonder if we should add a step for simplicity where we enable ingester.ring.unregister-on-shutdown
and blocks-storage.tsdb.flush-blocks-on-shutdown
for the zone-default ingesters. Would this let us skip the complicated scale-down process in Step 6?
Proposed configuration for the migration so we have something to think about:
migration:
multizone:
step: 1 # This value would be incremented from 1-8 throughout the migration
Internally, we would need to set the relevant settings using conditions like
{{ if ge .Values.migration.multizone.step 1 }}
# Set configuration from step 1
{{ end }}
{{ if ge .Values.migration.multizone.step 2 }}
# Set configuration from step 2
{{ end }}
# etc
I propose that we follow this scheme for the 4.0.0 release, supporting single-zone deployments, multi-zone deployments, and migration from single-zone to multizone. Of course, this will need to be accompanied by some quality docs to guild users through the process.
Starting in release 5.0.0 we would drop support for single-zone and remove the migration logic.
@09jvilla @krajorama @dimitarvdimitrov @ortuman Thoughts on the proposal here ^. I believe this is similar to what @krajorama has proposed in the past.
Current state: @ryan-dyer-sp is going to move the PR from the helm-charts repo over here next week and we'll go from there.
Documenting the migration strategy for existing Helm users:
I'm leaving few minor comments below, just to be on the same page.
Step 1 Deploy 3 new StatefulSets with 0 replicas.
Deploy a multi-zone aware PodDisurptionBudget too.
Step 2 Scale up/out ruler & queriers to accommodate disabling shuffle sharding (TODO: not used in Helm by default)
This shouldn't be done "n times".
Step 4 Enable zone-aware replication
Just to be clear: "Enable zone-aware replication on the write path".
I wonder if we should add a step for simplicity where we enable ingester.ring.unregister-on-shutdown and blocks-storage.tsdb.flush-blocks-on-shutdown for the zone-default ingesters. Would this let us skip the complicated scale-down process in Step 6?
The problem is that when you'll revert the change back (which requires an ingesters rollout), it will unregister ingesters and flush blocks for multi-zone ingesters too. The unregistering from the ring will cause a sudden reshuffle of series. Also, if rollout-operator is already in place, removing multiple ingestres from ring at the same time will add additional pressure to remaining ingesters (potentially going OOM).
@pracucci thanks!
The problem is that when you'll revert the change back (which requires an ingesters rollout), it will unregister ingesters and flush blocks for multi-zone ingesters too.
I was intending to only enable those settings for one stateful set: the zone-default ingesters which are going away. I would not enable these settings on the new multi-zone ingesters. I'm not understanding how this is different than calling the shutdown endpoint and manually removing them from the ring.
One possible configuration would be to use the node name as the zone name, and include topologySpreadConstraints
in the helm chart to spread the pods as evenly as possible across all nodes.
This will allow running multiple pods on each node (fewer nodes than pods), also avoid the requirement of configuring a predetermined, fixed number of statefulsets.
The problem is that when you'll revert the change back (which requires an ingesters rollout), it will unregister ingesters and flush blocks for multi-zone ingesters too.
I was intending to only enable those settings for one stateful set: the zone-default ingesters which are going away. I would not enable these settings on the new multi-zone ingesters. I'm not understanding how this is different than calling the shutdown endpoint and manually removing them from the ring.
Good point. We can just set it on the zone-default ingesters. Then I think your idea is good. I checked the code and I think there's no difference compared to calling the /shutdown
endpoint manually.
We should get topologySpreadConstraints going in the chart so we multi zone is built on it already
~Blocked on engineer availability~
I think we need a "quick" version of the migration path first, which just shuts down ingesters in the default zone with flush and on the next step you'd start the new ingesters. This would mean stopping write path to not make the distributors OOM. The simplest way I can imagine is to stop nginx/gateway.
Testing tracked in https://github.com/grafana/mimir-squad/issues/921