mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Helm: multi zone ingesters, store gateways and alertmanager

Open krajorama opened this issue 2 years ago • 15 comments

Is your feature request related to a problem? Please describe.

  1. It takes a long time to upgrade many ingesters one-by-one with the Mimir Helm chart.
  2. Mimir Helm chart does not support zone aware replication out of the box.

Describe the solution you'd like

Implement zone aware replication in Helm chart and make it the default (at least for ingesters, TBD for others). Implement upgrade path from previous stable version of the chart. Targeting major version change. Document the migration.

new! It is possible that the user does not have the resources to run the double amount of ingesters, but they cannot do downtime either. We probably need to support single zone for a while, but we should deprecate the single zone setting.

Solution proposal in helm-charts repo: https://github.com/grafana/helm-charts/pull/1205

Describe alternatives you've considered

N/A

Additional context

Prototype exists.

krajorama avatar Jun 03 '22 12:06 krajorama

alertmanager

FYI, Mimir jsonnet doesn't support alertmanager deployment in multi-zone. If the primary purposes of multi-zone is to speed up rollouts, then it's non issue for alertmanagers (you typically run a low number of replicas, even in very large clusters).

pracucci avatar Jun 03 '22 13:06 pracucci

Whoever picks this up, we mentioned that this may be a big enough issue that it might need two people. Please post in #wg-helm if you're looking for someone else to help out.

09jvilla avatar Jun 08 '22 16:06 09jvilla

Documenting the migration strategy for existing Helm users:

Preparation

Step 1

  • [ ] Set the zone for all ingesters to zone-default
  • [ ] Deploy 3 new StatefulSets with 0 replicas.
    • [ ] Different zone settings for each
    • [ ] rollout-operator annotations
    • [ ] OnDelete strategy
    • [ ] affinity rule should prevent multiple ingesters of different zones on a single node
    • [ ] Multi-zone aware PodDisruptionBudget
  • [ ] Scale up series limits to accommodate duplicated series during scale up

Scale up & Enable Multi-zone

Step 2

  • [ ] Progressively scale zone-aware ingesters up (~21 at a time across all zones, 7 per zone)
  • [ ] Wait for TSDB head compaction and at least on hour after scale up
  • [ ] Scale up/out ruler & queriers to accommodate disabling shuffle sharding (TODO: not used in Helm by default)

Step 3

  • [ ] Disable ingester shuffle-sharding on the read path (TODO: not used in Helm by default)
  • [ ] TEST: rule evaluations are working

Step 4

  • [ ] Enable zone-aware replication on the write path (this will reshuffle many series ~40%+ overhead)
  • [ ] TEST: rule evaluations are working
  • [ ] TEST: observe series overhead and monitor for issues
  • [ ] Wait 12h (for query-store-after & shuffle sharding lookback)

Cleanup

Step 5

  • [ ] Enable zone-aware replication on the read path
  • [ ] Exclude zone-default ingesters from distributors and rulers
  • [ ] TEST: zone-default ingesters should have no writes

Step 6

  • [ ] Perform a safe scale down of zone-default ingesters, calling shutdown endpoint and unregistering from the ring
  • [ ] Wait for TSDB compaction

Step 7

  • [ ] Cleanup default ingester statefulset, pvcs, etc.
  • [ ] Revert the limits increase
  • [ ] Wait 12h since zone-default exclusion

Step 8

  • [ ] Re-enable shuffle sharding on the read path (TODO: not used in Helm by default)

Logiraptor avatar Jun 24 '22 16:06 Logiraptor

I wonder if we should add a step for simplicity where we enable ingester.ring.unregister-on-shutdown and blocks-storage.tsdb.flush-blocks-on-shutdown for the zone-default ingesters. Would this let us skip the complicated scale-down process in Step 6?

Logiraptor avatar Jun 24 '22 18:06 Logiraptor

Proposed configuration for the migration so we have something to think about:

migration:
  multizone:
    step: 1 # This value would be incremented from 1-8 throughout the migration

Internally, we would need to set the relevant settings using conditions like

{{ if ge .Values.migration.multizone.step 1 }}
# Set configuration from step 1
{{ end }}
{{ if ge .Values.migration.multizone.step 2 }}
# Set configuration from step 2
{{ end }}
# etc

I propose that we follow this scheme for the 4.0.0 release, supporting single-zone deployments, multi-zone deployments, and migration from single-zone to multizone. Of course, this will need to be accompanied by some quality docs to guild users through the process.

Starting in release 5.0.0 we would drop support for single-zone and remove the migration logic.

Logiraptor avatar Jun 24 '22 18:06 Logiraptor

@09jvilla @krajorama @dimitarvdimitrov @ortuman Thoughts on the proposal here ^. I believe this is similar to what @krajorama has proposed in the past.

Logiraptor avatar Jun 24 '22 18:06 Logiraptor

Current state: @ryan-dyer-sp is going to move the PR from the helm-charts repo over here next week and we'll go from there.

Logiraptor avatar Jun 24 '22 19:06 Logiraptor

Documenting the migration strategy for existing Helm users:

I'm leaving few minor comments below, just to be on the same page.

Step 1 Deploy 3 new StatefulSets with 0 replicas.

Deploy a multi-zone aware PodDisurptionBudget too.

Step 2 Scale up/out ruler & queriers to accommodate disabling shuffle sharding (TODO: not used in Helm by default)

This shouldn't be done "n times".

Step 4 Enable zone-aware replication

Just to be clear: "Enable zone-aware replication on the write path".

I wonder if we should add a step for simplicity where we enable ingester.ring.unregister-on-shutdown and blocks-storage.tsdb.flush-blocks-on-shutdown for the zone-default ingesters. Would this let us skip the complicated scale-down process in Step 6?

The problem is that when you'll revert the change back (which requires an ingesters rollout), it will unregister ingesters and flush blocks for multi-zone ingesters too. The unregistering from the ring will cause a sudden reshuffle of series. Also, if rollout-operator is already in place, removing multiple ingestres from ring at the same time will add additional pressure to remaining ingesters (potentially going OOM).

pracucci avatar Jun 28 '22 14:06 pracucci

@pracucci thanks!

The problem is that when you'll revert the change back (which requires an ingesters rollout), it will unregister ingesters and flush blocks for multi-zone ingesters too.

I was intending to only enable those settings for one stateful set: the zone-default ingesters which are going away. I would not enable these settings on the new multi-zone ingesters. I'm not understanding how this is different than calling the shutdown endpoint and manually removing them from the ring.

Logiraptor avatar Jun 28 '22 14:06 Logiraptor

One possible configuration would be to use the node name as the zone name, and include topologySpreadConstraints in the helm chart to spread the pods as evenly as possible across all nodes.

This will allow running multiple pods on each node (fewer nodes than pods), also avoid the requirement of configuring a predetermined, fixed number of statefulsets.

starleaffff avatar Jul 04 '22 05:07 starleaffff

The problem is that when you'll revert the change back (which requires an ingesters rollout), it will unregister ingesters and flush blocks for multi-zone ingesters too.

I was intending to only enable those settings for one stateful set: the zone-default ingesters which are going away. I would not enable these settings on the new multi-zone ingesters. I'm not understanding how this is different than calling the shutdown endpoint and manually removing them from the ring.

Good point. We can just set it on the zone-default ingesters. Then I think your idea is good. I checked the code and I think there's no difference compared to calling the /shutdown endpoint manually.

pracucci avatar Jul 04 '22 07:07 pracucci

We should get topologySpreadConstraints going in the chart so we multi zone is built on it already

krajorama avatar Jul 29 '22 13:07 krajorama

~Blocked on engineer availability~

krajorama avatar Aug 16 '22 06:08 krajorama

I think we need a "quick" version of the migration path first, which just shuts down ingesters in the default zone with flush and on the next step you'd start the new ingesters. This would mean stopping write path to not make the distributors OOM. The simplest way I can imagine is to stop nginx/gateway.

krajorama avatar Aug 25 '22 07:08 krajorama

Testing tracked in https://github.com/grafana/mimir-squad/issues/921

krajorama avatar Oct 14 '22 11:10 krajorama