restate Add support for draining nodes from PPs

When trying to remove a node, it would be least disruptive if this node wouldn't run a PP anymore (even more so if it's the current leader). Therefore, it would be beneficial if we had a way to tell the cluster controller that a certain set of nodes should not longer be considered as targets for the partition processor. Moreover, if these nodes would run a PP, then they should be gracefully moved away from the draining nodes onto others.

Mar 24 '25 10:03 tillrohrmann

One way to solve this problem is to introduce a WorkerState that we store in the NodesConfiguration. The variants could be:

enum WorkerState {
  /// Node should be considered for placing partition processors on it. It can already run a set of partition processors.
  Active,
  /// Node should no longer be considered for placing partition processors on it. The system is trying to move partition processors from this node away. It can still run a set of partition processors.
  Draining,
  /// Node is not running any partition processors.
  Drained,
}

Once a node is marked as WorkerState::Draining, the Scheduler will no longer consider those nodes when ensuring the PartitionReplication. This means if PartitionReplication::Limit(2) is configured and we mark node n1, which is running the leader for pp 1 as WorkerState::Draining, it will add a replacment node to the partition placement to compensate for n1. The PP running on n1 will be stopped once all partition processors required for fulfilling the PartitionReplication have fully caught up. Leadership could already be transferred if there is a caught up PP running on an active node.

Once a node is no longer running any partition processors, it is safe to transition it's state from WorkerState::Draining to WorkerState::Drained. When removing a node from the Restate cluster, it is safe to remove a node that is in WorkerState::Drained from the perspective of partition processors.

For the decentralized partition processor leader election where we use rendezvous hashing for creating different partition processor placements for each partition and then use the local liveness information for selecting which node to run the PP leader, the WorkerState could be treated as an additional filter on top of the partition placement. A draining node could stop its leading PP once it sees that another PP has taken over the leadership. For when to stop follower PPs that run on a draining node in the decentralized world we would either need information on how caught up the replacement nodes are or rely on some heuristics (e.g. time, or seeing that an active node has taken over leadership). W/o the information how caught up nodes are, it is hard to provide a seamless leadership change where we guarantee that the new leader is ready to process requests.

cc @AhmedSoliman

Mar 24 '25 12:03 tillrohrmann

Done via the replica set work (fa3998719461b9ecdd7e9c12794cac06a6fb2189 and other commits).

Jul 07 '25 14:07 tillrohrmann