vitess
vitess copied to clipboard
RFC: balance `PRIMARY` hosts across cells
Feature Description
This request is for a "balancer" that ensures PRIMARY
hosts are evenly balanced across certain/all cells
Goals
The goal of this balancer would be to:
- Reduce the number of failovers necessary when:
- A cell needs to be drained
- A cell is partitioned
- Ensure all cells have an equal probability of cross-cell writes (averaged across many shards)
- Hopefully 66% (with 3 cells) vs 100%
- Ensure new/un-draining cells receive an equal share of
PRIMARY
s
Ideas
From my perspective this kind of balancer could be asynchronous and best-effort, as the actions performed are simply optimisations and performing them too quickly could cause disruptions
To ensure the balancer causes the least impact possible, I suggest:
- It doesn't perform too many actions at once, to reduce client and topology impact
- It avoids shards that recently had a failover
- Sorting by oldest
primary_term_start_time
fromtopodata.proto
might achieve this
- Sorting by oldest
- Topology locks are used to prevent race conditions with
vtorc
/PRS/ERS
Using a good topology locking scheme, I feel this logic could safely live inside vtorc
(or vtctld
in theory) as a background goroutine using a ticker. A per-cell or per-keyspace/shard lock could be used depending on concurrency requirements
Use Case(s)
Consider a Vitess deployment with:
- 3 x cells with
PRIMARY
s randomly distributed in each - Each cell having a dedicated local topology (eg: etcd, zookeeper, consul cluster)
- Each cell having equal client traffic
- A requirement to drain single cells periodically
- 100-1000s of shards
-
vtorc
/orchestrator
is used for failover
Simplified diagram:
In this deployment if 1/3 cells is drained, vtorc
/orchestrator
will cause all PRIMARY
nodes to be located in the 2 x remaining cells. When the drained cell comes back online it will have no PRIMARY
nodes and writes from this cell will have a higher latency in aggregate
Now let's say we need to drain the 2nd and 3rd cell (one at a time). When these cells are drained about 1/2 (depending) of global PRIMARY
nodes will require failover instead of roughly/hopefully 1/3rd if everything was balanced. This causes more work for the systems backing the topology in those unlucky cells and also more client disruption to write traffic (although these writes may be buffered in vtgate
)
A topology with PRIMARY
nodes equally distributed should perform cell-wide failovers more efficiently while providing a more predictable rate of cross-cell writes
cc @GuptaManan100
For some prior art, Kubernetes spent a long time getting to topology spread constraints. I think they started specifically with balancing in mind https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
More specifically, the original orchestrator has some rules around preferences and anti-flapping that might be relevant here - I'm not sure how much of this remains after turning into vt-orc https://github.com/openark/orchestrator/blob/master/docs/topology-recovery.md#adding-promotion-rules
For our use cases, we have certain keyspaces that are more likely to interact and have materialize workflows running. Most of our reads are primary only, and keeping related primaries in the same cell reduces latency and cross-zone/region egress costs. At the end of the day, this also serves to spread our primaries across cells, but in a very specific manner. Our manually applied preferences look something like:
- prefer to have keyspaces A and B in cell 1
- prefer to have keyspaces C and D in cell 2
If I put on my wishlist hat, having some rules available in vt-orc unlocks some other interesting possibilities. One specific case I have in mind is Cockroach's Follow the Workload. Yes, we can do multi-region sharding for some use cases, but other global keyspaces could benefit from reparenting the primary from the US to Europe to Asia, etc., depending on access patterns, which can significantly reduce write latency.
Overall I think this is a great feature request, and my comment is just meant to add some hopefully helpful context.
Highly interested in this. We already use Kubernetes topology spread constraints but it miss the primary/replicas knowledge.
We are using some custom scripts to alert us if a rebalance is needed base on hostname more than cells.
We have a too few cells and needs balance between node/host that can have multiple vttablet.
For example node A have 2 vttablet, a keyspace/-80 primary and shard/80- replica Node B have keyspace/-80 replica and keyspace/80- primary ...
I think this would be a great feature to have. I gave thought about implementing this in the VTOrc and I have a few concerns in that regard. The way we advertise using VTOrc now is to run as many instances of it as you want each looking only at a subset of your topology. This feature could run into troubles in this kind of architecture, specifically because no VTOrc instance sees the complete topology!
For example, a very common case would be to have let's say 2 cells, and 3 shards (Let's call them A, B and C). If the user intends to run 3 VTOrc instances, then the best way to run them is to have each VTOrc monitor 2 clusters i.e. the clusters each VTOrc monitors are like (A,B), (B,C) and (C,A). This way even if one VTOrc instance fails, there is another monitoring the cluster.
In this situation let's say we implemented the logic to distribute the workload across cells in VTOrc, we would run into the problem of constant reparents. Let me explain.
Let's say primary-A
is in cell-1
, primary-B
and primary-C
are in cell-2
. It appears to us (the users, who see the full topology), that this situation is perfectly balanced, but the VTOrc instance monitoring (B,C) would see both the primaries are in the same cell even though there are 2 of them, so it would conclude they aren't distributed and would go ahead and reparent one of them. Let's say it reparented primary-B
to now reside in cell-1
. But now the first VTOrc instance monitoring (A,B) will see both the primaries in the same cell and reparent one of them again! This cycle will keep on going.
This kind of cyclic reparents will be very hard to track in each individual VTOrc instance. The underlying problem in all of this being that VTOrc doesn't see the full topology scope, but has it restricted to only the clusters it is monitoring.
The next logical question is why not have all VTOrc manage all the shards. This isn't feasible the memory consumption of VTOrc, the computation of the problems that it needs to solve, the network calls it does, all linearly increase with the number of clusters it manages. So this way of implementing the balancing in VTOrc won't scale very well.
WDYT? @deepthi @timvaillancourt ☝️
I think we can add this as a command to vtctldclient
to rebalance the cluster when the command is run. We might also go one step further and add a loop that runs continuously in vtctld
.
The way we advertise using VTOrc now is to run as many instances of it as you want each looking only at a subset of your topology. This feature could run into troubles in this kind of architecture, specifically because no VTOrc instance sees the complete topology!
@GuptaManan100 👋 I hadn't thought about this limitation, thanks for calling it out
The next logical question is why not have all VTOrc manage all the shards. This isn't feasible the memory consumption of VTOrc, the computation of the problems that it needs to solve, the network calls it does, all linearly increase with the number of clusters it manages. So this way of implementing the balancing in VTOrc won't scale very well.
This was the approach I had in mind. The context on resource usage impact is great to know 👍
I think we can add this as a command to vtctldclient to rebalance the cluster when the command is run. We might also go one step further and add a loop that runs continuously in vtctld.
I wouldn't be opposed to this logic living in vtctld
, however I'm a little concerned about resource usage on vtctld
s as well, as there is usually a fixed number of nodes that are already running a bit warm at times due to splits/vdiffs. As an example, a cell drain could cause a need for roughly 1200+ reparents (growing) in our deployment today
Flipping back to vtorc
, while not guaranteed, let's assume that a deployment running vtorc
has 100% coverage of all keyspaces/shards across all vtorc
s. Are there options to do this balancing in a distributed fashion using each vtorc
🤔? As an example, if a vtorc
watches entire keyspaces it could:
- Hold a keyspace-level lock
- Find out how many
PRIMARY
s are in every cell - Perform PRS/ERS commands to create balance
Of course, several vtorc
s concurrently balancing without much coordination could create some inaccuracies, but maybe some degree of imbalance is acceptable or entropy would eventually stop reparents. Perhaps being +/- 5% balanced is good enough. Also balancing PRIMARY
s with the longest term first should avoid single-shards flapping, however. For vtorc
s that listen to specific shards this approach would much less effective, if at all, and I don't have a good solution for that
@GuptaManan100 / @deepthi your thoughts are appreciated, thanks for the discussion 🙇
@timvaillancourt Even with some amount of toleration in VTOrc, I fear flapping will be unavoidable, because whatever the percentage of imbalance you set, it would probably be possible to create a configuration of VTOrcs managing subsets such that one of them always reports an imbalance wherever all the primaries live.
If we are going to be balancing the primaries on a per keyspace level, I still think it is better lived in the vtctld as a command that can be called. As far as usage on vtctld goes, I think we can just ask the users to provision another vtctld just for running this command (if resources ever become an issue).
Also, I don't see too much value in VTOrc checking for imbalance regularly. Let's say the user has run the vtctld command/manually reparented the primaries such that the servers are balanced across cells. Any imbalance in the cluster would probably not be introduced immediately. A few reparents will be required for that to happen. In general this would take time and it wouldn't be worthwhile for VTOrc to keep checking for such an imbalance.
Another idea that I had while typing out this response, is that we could potentially just add a flag to PRS
and ERS
called --prefer-balanced-cells
(or whatever else we call it), which basically makes them prioritise selecting a primary from a cell such that the configuration remains balanced. What this ensures is that from the point that cluster comes up (no primaries anywhere), each reparent itself makes sure to keep the balance, so we wouldn't need any polling what-so-ever. It would then be upto the users whether they want to run with this balancing or not. This also removes the possibility of extra reparents. We wouldn't need polling either in VTOrc (or anywhere else), and resource would not be that much of an issue, because we would just be accessing more shard records in a reparent operation. This will however, potentially reduce the concurrency of a reparent operation from per-shard to per-keyspace level (behind the flag that is).
What do you think of this suggestion?
If we are going to be balancing the primaries on a per keyspace level, I still think it is better lived in the vtctld as a command that can be called. As far as usage on vtctld goes, I think we can just ask the users to provision another vtctld just for running this command (if resources ever become an issue).
@GuptaManan100 thanks for the feedback, I think the vtctld
approach you mentioned will work well. At some point it might make sense to make this a scheduled task (out of scope) but crond
could be used to begin with 👍
Another idea that I had while typing out this response, is that we could potentially just add a flag to PRS and ERS called --prefer-balanced-cells (or whatever else we call it), which basically makes them prioritise selecting a primary from a cell such that the configuration remains balanced. What this ensures is that from the point that cluster comes up (no primaries anywhere), each reparent itself makes sure to keep the balance, so we wouldn't need any polling what-so-ever
I like this idea a lot 👍. I'd prefer to separate this into a new RFC/PR as an "optimisation" to follow this one. The reason being this won't resolve problems with a cell that was partitioned or drained and later comes back online, which this RFC hoped to address. In that scenario no reparents would need to occur, and thus no balancing.
Without this "prefer balanced" in PRS it will make some reparents that will later need to be "rebalanced", which I think it acceptable until that optimisation is added
What do you think about this approach @GuptaManan100? 🙇
Sure, yes that sounds good. Adding a vtctld command as the first step is perfectly reasonable 👍!
I missed this RFC when it first came up. Very interesting discussion. The one thing I will add is that for those running with vitess-operator, that would be a natural place to implement any rebalancing, because it in fact has a global view of the full vitess cluster.