alan icon indicating copy to clipboard operation
alan copied to clipboard

Datastore rebalancing on node scale up/down

Open dfellis opened this issue 3 years ago • 10 comments

Preface

This is sort of a "mini-RFC". Having the distributed version of datastore automatically rebalance the keys when the cluster changes size (and/or membership) is a given, but the exact implementation is still in question, and both have their pros and cons. This issue is to debate which path to take.

There are essentially two models for this work: push and pull.

Push

It's dangerous to go alone! Take this

In this approach, each node triggers on changes in the cluster and push relevant data to the nodes.

When a new node comes in, the node pushing pushes copies of keys to it that it owns that it believe the new node should have. The rule here is: if the node pushing was the primary owner before and the new node is the primary owner, the node pushes to the new primary node. If the new node coming in is not the primary node but should be one of the secondaries for backup purposes, the node above it will push to it.

When a node disappears: Any data that the now-missing node owned must be pushed by one of the other nodes in the list. If that node was the last secondary, then the second-to-last secondary pushes the data to the new last secondary node. If it was anywhere else in the set, the former last secondary pushes to the new last secondary.

If there's a combination of nodes appearing and disappearing between checks, the specific actions to perform become more complicated, but it should be solvable.

The biggest pro of this solution is that the cluster follows simple rules and rebalances without explicit coordination between the nodes. The biggest con is figuring out the rules of what each node should be doing for each combination of +new/-old nodes in the cluster.

Pull

Show me what you got!

In this approach, when a new node comes online, it polls all other nodes in the cluster asking for data that it should have locally. It receives gets sets of kv pairs from each other node and it uses the value it determines is from the primary node for the key (or if it believes it is the primary, from the one it determines is the first secondary node), similarly, it tells whichever secondary node it determined should no longer own the key to delete the data.

When a node disappears, every other node still in the cluster needs to poll every other node for data that they have that the dead node should have had responsibility for that the other node also has. It then takes these sets of kv pairs, uses the value that came from whichever node is deemed the current primary node, and then informs any nodes that gave it a kv pair that it should no longer be responsible for to delete it.

The biggest pro of this solution is there's no special logic necessary for larger combinations of added/removed nodes are determined, but the biggest con is the n^2 socket storm that occurs (n per node, but total connections n^2) when nodes are removed from the pool and all of the duplicated data flying around between the nodes. That could be mitigated a bit by making it a multi-part setup where it only gets lists of keys from all of the nodes and then they request the values they want from which nodes and request deletion of data from which other nodes, but there would still be (m - 1) * (n - 1) duplicate keys, I believe.

Conclusion

The feeling I get is that "pull" is the simpler solution to implement first and perhaps we revisit for the less-bandwidth-intensive "push" in the future, but only after confirming that it doesn't get stuck in weird situations where data gets lost. But I would be very appreciative of other perspectives on this. :)

dfellis avatar May 27 '21 22:05 dfellis

Seen how we have everything in place I agree that the "pull" option seems the path to follow since it's the new node the one to initialize its control port and see the current nodes available. Maybe would be possible to implement something where the new nodes "know" which are the nodes that will be deleted (upgrades, scale down or rollback cases) and just pull the data from those nodes?

Another question I have is, could we have some data lost with out current steps dealing with this process? Example of an upgrade:

  1. Create new vms
  2. Add to DNS
  3. Delete old one from DNS
  4. Delete old machine

Probably the daemon will need to notify the deploy service between steps 3 and 4 that is safe to delete the vms since all data have been balanced?

aguillenv avatar May 28 '21 12:05 aguillenv

@aguillenv yep, that also needs to be tackled, and was going to be the next task I'm going after once this one is done. :)

The current strategy of halving the cluster on scale down in a single shot would break both of these if the cluster size is more than 2x the number of key replicas.

I have some ideas on that, but I want to save it for the next issue I create. :)

dfellis avatar May 28 '21 17:05 dfellis

I personally think that it might be better to do push - I feel like pull not only requires the extra sockets to grab the data but it also requires more packets for data deletion (unless we want to expose the possibility for inconsistent data by having the pulled node determine when the data is deleted). Less congestion = more bandwidth for users = better ux, in my opinion

cdmistman avatar Jun 01 '21 17:06 cdmistman

I personally think that it might be better to do push - I feel like pull not only requires the extra sockets to grab the data but it also requires more packets for data deletion (unless we want to expose the possibility for inconsistent data by having the pulled node determine when the data is deleted). Less congestion = more bandwidth for users = better ux, in my opinion

The main issue with it, and why I am leaning towards pull, is the reliability of the results.

The biggest con is figuring out the rules of what each node should be doing for each combination of +new/-old nodes in the cluster.

Basically, with a large enough combination of added and removed nodes, the rules for them to determine which nodes to push data or delete instructions to becomes complicated, especially when there's also the time impact. If these are going up and down rapidly but the nodes are sampling cluster state less frequently than the changes to the cluster, then the computed push commands may be wrong.

Simplest example I can think of off the top of my head: what if a node goes down and comes back up before being noticed by any other node in the cluster? Now it no longer has any of the data but none of the nodes think it needs to re-replicate keys to it.

Basically, I'm leaning towards "push" being "too clever" and requiring a lot of testing on edge cases while "pull" burns extra sockets but is self-stabilizing.

dfellis avatar Jun 01 '21 17:06 dfellis

That's solved by an announcement message or a bool flag right? Once a node starts up, it either sets a flag in dns stating it needs all data, or it can send a message over udp multicast to all the other nodes maybe?

cdmistman avatar Jun 01 '21 17:06 cdmistman

That's solved by an announcement message or a bool flag right? Once a node starts up, it either sets a flag in dns stating it needs all data, or it can send a message over udp multicast to all the other nodes maybe?

But that is the "pull" approach, in a nutshell? The nodes can't independently determine the proper state of the cluster replication so they have to coordinate with each other.

A broadcast message of some sort asking for all keys and then pulling the ones it actually needs.

dfellis avatar Jun 01 '21 18:06 dfellis

yeah, mostly - the big difference being that the new node is just announcing itself, whereas in the pull method it would poll each of the other nodes for the data (unless I'm missing something?). The other nodes are still determining which data to give to the new node, and the new node only has to send 1 packet

cdmistman avatar Jun 01 '21 18:06 cdmistman

yeah, mostly - the big difference being that the new node is just announcing itself, whereas in the pull method it would poll each of the other nodes for the data (unless I'm missing something?). The other nodes are still determining which data to give to the new node, and the new node only has to send 1 packet

This is true, but there's an underlying assumption on this that I'm not entirely comfortable with: all of the nodes in the cluster have the same "view" of the membership of the cluster.

Suppose a new node comes up and an old node dies at the same time. The new node announces itself and all of the other nodes (except the dead one) push keys to it. The dead one does not, so certain keys are not replicated to it that the others will implicitly assume have been.

Eventually the nodes notice that there is a dead node, but it will be very easy for that code to say "new node" already has the relevant data since "dead node" should have pushed it there, so we'll only rebalance this other set of keys and the new node is never pushed the keys it needs.

Then the datacenter-aware reading of data (for reduced latency) assumes that such a key must exist on "new node" and it requests it, and gets back nothing, so then that node acts on a different state than the rest of the cluster and breaks user logic.

To resolve this would require the new node to somehow "know" about keys that were never pushed to it, or for the other nodes to somehow not "trust" its own push logic, or maintain some sort of cluster change state tracking and "double check" prior state changes if a node fails and its own push logic may have failed (how far back does this checking go? which node should be responsible in such a case? what if that node fails, too, and the 2nd removed node doesn't have a node responsible for it when it fails? etc).

Adding all of those things feels brittle and adds back some of the bandwidth consumption again.

dfellis avatar Jun 01 '21 18:06 dfellis

Another alternative is maybe using something like what Riak/Cassandara/ES do which is having a concept/abstraction of partitions? A node can have 0 or multiple partitions and each partition can take up different amounts of the key space. That way when we add a new node we can rebalance partitions, but we can also leave as-is to reduce network bandwidth. Down the line we could also optimize partitions to account for uneven traffic across the key space.

This approach probably does not makes sense on the first pass. The vanilla "pull" methods sounds good to me!

depombo avatar Jun 01 '21 19:06 depombo

Another alternative is maybe using something like what Riak/Cassandara/ES do which is having a concept/abstraction of partitions? A node can have 0 or multiple partitions and each partition can take up different amounts of the key space. That way when we add a new node we can rebalance partitions, but we can also leave as-is to reduce network bandwidth. Down the line we could also optimize partitions to account for uneven traffic across the key space.

This approach probably does not makes sense on the first pass. The vanilla "pull" methods sounds good to me!

Was very busy today, but wanted to respond to this. Partitions are a key component to Consistent Hashing, not Rendezvous Hashing. It's a more "chatty" algorithm because the various nodes have to constantly update each other on partition assignment, splitting/merging, etc.

It may be that we eventually find ourselves there (because of other things we want to support in the way data is distributed) but if the main objection to the "pull" approach is bandwidth considerations, that won't help here. :)

dfellis avatar Jun 02 '21 04:06 dfellis