vitess RFC: balance `PRIMARY` hosts across cells

Feature Description

This request is for a "balancer" that ensures PRIMARY hosts are evenly balanced across certain/all cells

Goals

The goal of this balancer would be to:

Reduce the number of failovers necessary when:
- A cell needs to be drained
- A cell is partitioned
Ensure all cells have an equal probability of cross-cell writes (averaged across many shards)
- Hopefully 66% (with 3 cells) vs 100%
Ensure new/un-draining cells receive an equal share of PRIMARYs

Ideas

From my perspective this kind of balancer could be asynchronous and best-effort, as the actions performed are simply optimisations and performing them too quickly could cause disruptions

To ensure the balancer causes the least impact possible, I suggest:

It doesn't perform too many actions at once, to reduce client and topology impact
It avoids shards that recently had a failover
- Sorting by oldest primary_term_start_time from topodata.proto might achieve this
Topology locks are used to prevent race conditions with vtorc/PRS/ERS

Using a good topology locking scheme, I feel this logic could safely live inside vtorc (or vtctld in theory) as a background goroutine using a ticker. A per-cell or per-keyspace/shard lock could be used depending on concurrency requirements

Use Case(s)

Consider a Vitess deployment with:

3 x cells with PRIMARYs randomly distributed in each
Each cell having a dedicated local topology (eg: etcd, zookeeper, consul cluster)
Each cell having equal client traffic
A requirement to drain single cells periodically
100-1000s of shards
vtorc/orchestrator is used for failover

Simplified diagram: primary-cellbalancer

In this deployment if 1/3 cells is drained, vtorc/orchestrator will cause all PRIMARY nodes to be located in the 2 x remaining cells. When the drained cell comes back online it will have no PRIMARY nodes and writes from this cell will have a higher latency in aggregate

Now let's say we need to drain the 2nd and 3rd cell (one at a time). When these cells are drained about 1/2 (depending) of global PRIMARY nodes will require failover instead of roughly/hopefully 1/3rd if everything was balanced. This causes more work for the systems backing the topology in those unlucky cells and also more client disruption to write traffic (although these writes may be buffered in vtgate)

A topology with PRIMARY nodes equally distributed should perform cell-wide failovers more efficiently while providing a more predictable rate of cross-cell writes

cc @GuptaManan100

May 12 '23 18:05 timvaillancourt

For some prior art, Kubernetes spent a long time getting to topology spread constraints. I think they started specifically with balancing in mind https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/

More specifically, the original orchestrator has some rules around preferences and anti-flapping that might be relevant here - I'm not sure how much of this remains after turning into vt-orc https://github.com/openark/orchestrator/blob/master/docs/topology-recovery.md#adding-promotion-rules

For our use cases, we have certain keyspaces that are more likely to interact and have materialize workflows running. Most of our reads are primary only, and keeping related primaries in the same cell reduces latency and cross-zone/region egress costs. At the end of the day, this also serves to spread our primaries across cells, but in a very specific manner. Our manually applied preferences look something like:

prefer to have keyspaces A and B in cell 1
prefer to have keyspaces C and D in cell 2

If I put on my wishlist hat, having some rules available in vt-orc unlocks some other interesting possibilities. One specific case I have in mind is Cockroach's Follow the Workload. Yes, we can do multi-region sharding for some use cases, but other global keyspaces could benefit from reparenting the primary from the US to Europe to Asia, etc., depending on access patterns, which can significantly reduce write latency.

Overall I think this is a great feature request, and my comment is just meant to add some hopefully helpful context.

May 12 '23 19:05 derekperkins

Highly interested in this. We already use Kubernetes topology spread constraints but it miss the primary/replicas knowledge.

We are using some custom scripts to alert us if a rebalance is needed base on hostname more than cells.

We have a too few cells and needs balance between node/host that can have multiple vttablet.

For example node A have 2 vttablet, a keyspace/-80 primary and shard/80- replica Node B have keyspace/-80 replica and keyspace/80- primary ...

May 12 '23 20:05 L3o-pold

I think this would be a great feature to have. I gave thought about implementing this in the VTOrc and I have a few concerns in that regard. The way we advertise using VTOrc now is to run as many instances of it as you want each looking only at a subset of your topology. This feature could run into troubles in this kind of architecture, specifically because no VTOrc instance sees the complete topology!

For example, a very common case would be to have let's say 2 cells, and 3 shards (Let's call them A, B and C). If the user intends to run 3 VTOrc instances, then the best way to run them is to have each VTOrc monitor 2 clusters i.e. the clusters each VTOrc monitors are like (A,B), (B,C) and (C,A). This way even if one VTOrc instance fails, there is another monitoring the cluster. In this situation let's say we implemented the logic to distribute the workload across cells in VTOrc, we would run into the problem of constant reparents. Let me explain. Let's say primary-A is in cell-1, primary-B and primary-C are in cell-2. It appears to us (the users, who see the full topology), that this situation is perfectly balanced, but the VTOrc instance monitoring (B,C) would see both the primaries are in the same cell even though there are 2 of them, so it would conclude they aren't distributed and would go ahead and reparent one of them. Let's say it reparented primary-B to now reside in cell-1. But now the first VTOrc instance monitoring (A,B) will see both the primaries in the same cell and reparent one of them again! This cycle will keep on going.

This kind of cyclic reparents will be very hard to track in each individual VTOrc instance. The underlying problem in all of this being that VTOrc doesn't see the full topology scope, but has it restricted to only the clusters it is monitoring.

The next logical question is why not have all VTOrc manage all the shards. This isn't feasible the memory consumption of VTOrc, the computation of the problems that it needs to solve, the network calls it does, all linearly increase with the number of clusters it manages. So this way of implementing the balancing in VTOrc won't scale very well.

WDYT? @deepthi @timvaillancourt ☝️

I think we can add this as a command to vtctldclient to rebalance the cluster when the command is run. We might also go one step further and add a loop that runs continuously in vtctld.

May 15 '23 09:05 GuptaManan100

The way we advertise using VTOrc now is to run as many instances of it as you want each looking only at a subset of your topology. This feature could run into troubles in this kind of architecture, specifically because no VTOrc instance sees the complete topology!

@GuptaManan100 👋 I hadn't thought about this limitation, thanks for calling it out

The next logical question is why not have all VTOrc manage all the shards. This isn't feasible the memory consumption of VTOrc, the computation of the problems that it needs to solve, the network calls it does, all linearly increase with the number of clusters it manages. So this way of implementing the balancing in VTOrc won't scale very well.

This was the approach I had in mind. The context on resource usage impact is great to know 👍

I think we can add this as a command to vtctldclient to rebalance the cluster when the command is run. We might also go one step further and add a loop that runs continuously in vtctld.

I wouldn't be opposed to this logic living in vtctld, however I'm a little concerned about resource usage on vtctlds as well, as there is usually a fixed number of nodes that are already running a bit warm at times due to splits/vdiffs. As an example, a cell drain could cause a need for roughly 1200+ reparents (growing) in our deployment today

Flipping back to vtorc, while not guaranteed, let's assume that a deployment running vtorc has 100% coverage of all keyspaces/shards across all vtorcs. Are there options to do this balancing in a distributed fashion using each vtorc 🤔? As an example, if a vtorc watches entire keyspaces it could:

Hold a keyspace-level lock
Find out how many PRIMARYs are in every cell
Perform PRS/ERS commands to create balance

Of course, several vtorcs concurrently balancing without much coordination could create some inaccuracies, but maybe some degree of imbalance is acceptable or entropy would eventually stop reparents. Perhaps being +/- 5% balanced is good enough. Also balancing PRIMARYs with the longest term first should avoid single-shards flapping, however. For vtorcs that listen to specific shards this approach would much less effective, if at all, and I don't have a good solution for that

@GuptaManan100 / @deepthi your thoughts are appreciated, thanks for the discussion 🙇

May 26 '23 15:05 timvaillancourt

@timvaillancourt Even with some amount of toleration in VTOrc, I fear flapping will be unavoidable, because whatever the percentage of imbalance you set, it would probably be possible to create a configuration of VTOrcs managing subsets such that one of them always reports an imbalance wherever all the primaries live.

If we are going to be balancing the primaries on a per keyspace level, I still think it is better lived in the vtctld as a command that can be called. As far as usage on vtctld goes, I think we can just ask the users to provision another vtctld just for running this command (if resources ever become an issue).

Also, I don't see too much value in VTOrc checking for imbalance regularly. Let's say the user has run the vtctld command/manually reparented the primaries such that the servers are balanced across cells. Any imbalance in the cluster would probably not be introduced immediately. A few reparents will be required for that to happen. In general this would take time and it wouldn't be worthwhile for VTOrc to keep checking for such an imbalance.

Another idea that I had while typing out this response, is that we could potentially just add a flag to PRS and ERS called --prefer-balanced-cells (or whatever else we call it), which basically makes them prioritise selecting a primary from a cell such that the configuration remains balanced. What this ensures is that from the point that cluster comes up (no primaries anywhere), each reparent itself makes sure to keep the balance, so we wouldn't need any polling what-so-ever. It would then be upto the users whether they want to run with this balancing or not. This also removes the possibility of extra reparents. We wouldn't need polling either in VTOrc (or anywhere else), and resource would not be that much of an issue, because we would just be accessing more shard records in a reparent operation. This will however, potentially reduce the concurrency of a reparent operation from per-shard to per-keyspace level (behind the flag that is). What do you think of this suggestion?

May 31 '23 09:05 GuptaManan100

If we are going to be balancing the primaries on a per keyspace level, I still think it is better lived in the vtctld as a command that can be called. As far as usage on vtctld goes, I think we can just ask the users to provision another vtctld just for running this command (if resources ever become an issue).

@GuptaManan100 thanks for the feedback, I think the vtctld approach you mentioned will work well. At some point it might make sense to make this a scheduled task (out of scope) but crond could be used to begin with 👍

Another idea that I had while typing out this response, is that we could potentially just add a flag to PRS and ERS called --prefer-balanced-cells (or whatever else we call it), which basically makes them prioritise selecting a primary from a cell such that the configuration remains balanced. What this ensures is that from the point that cluster comes up (no primaries anywhere), each reparent itself makes sure to keep the balance, so we wouldn't need any polling what-so-ever

I like this idea a lot 👍. I'd prefer to separate this into a new RFC/PR as an "optimisation" to follow this one. The reason being this won't resolve problems with a cell that was partitioned or drained and later comes back online, which this RFC hoped to address. In that scenario no reparents would need to occur, and thus no balancing.

Without this "prefer balanced" in PRS it will make some reparents that will later need to be "rebalanced", which I think it acceptable until that optimisation is added

What do you think about this approach @GuptaManan100? 🙇

Jul 19 '23 16:07 timvaillancourt

Sure, yes that sounds good. Adding a vtctld command as the first step is perfectly reasonable 👍!

Jul 21 '23 08:07 GuptaManan100

I missed this RFC when it first came up. Very interesting discussion. The one thing I will add is that for those running with vitess-operator, that would be a natural place to implement any rebalancing, because it in fact has a global view of the full vitess cluster.

Feb 29 '24 22:02 deepthi

vitess vitess copied to clipboard

RFC: balance `PRIMARY` hosts across cells

Feature Description

Goals

Ideas

Use Case(s)

vitess
vitess copied to clipboard