vespa
vespa copied to clipboard
Cluster controller ZooKeeper writes should be gated with a test-and-set on expected zknode version
Even though the cluster controller has a leader election algorithm built on top of a strongly consistent store (ZooKeeper), leaders today blindly write new state to ZooKeeper. This means that it's possible to encounter split brain scenarios where multiple cluster controllers believe they are the leader. In general this should only happen if the deposed leader does not receive an event from the ZooKeeper networking layer that connectivity has been lost before the newly elected leader's grace period runs out, but there is inherently no guarantee that this will be done in a timely fashion no matter what.
We should track zknode versions (and predicate setData
calls on these) for at least the following state:
- Current cluster state version
- Per-node persisted wanted states
If a KeeperException.BadVersion
exception is caught, any cluster controller that believes itself to be the leader should immediately drop leadership and refresh leader election state.
First item (Current cluster state version) is already done.