vespa Cluster controller ZooKeeper writes should be gated with a test-and-set on expected zknode version

Cluster controller ZooKeeper writes should be gated with a test-and-set on expected zknode version

Open vekterli opened this issue 7 years ago • 1 comments

Even though the cluster controller has a leader election algorithm built on top of a strongly consistent store (ZooKeeper), leaders today blindly write new state to ZooKeeper. This means that it's possible to encounter split brain scenarios where multiple cluster controllers believe they are the leader. In general this should only happen if the deposed leader does not receive an event from the ZooKeeper networking layer that connectivity has been lost before the newly elected leader's grace period runs out, but there is inherently no guarantee that this will be done in a timely fashion no matter what.

We should track zknode versions (and predicate setData calls on these) for at least the following state:

Current cluster state version
Per-node persisted wanted states

If a KeeperException.BadVersion exception is caught, any cluster controller that believes itself to be the leader should immediately drop leadership and refresh leader election state.

Jan 02 '18 12:01 vekterli

First item (Current cluster state version) is already done.

Jun 23 '21 11:06 geirst

vespa vespa copied to clipboard

Cluster controller ZooKeeper writes should be gated with a test-and-set on expected zknode version

vespa
vespa copied to clipboard