atlasdb icon indicating copy to clipboard operation
atlasdb copied to clipboard

Leader ping health check can get stuck in a long loop

Open jeremyk-91 opened this issue 5 years ago • 0 comments
trafficstars

See PDS-120860.

Suppose a timelock cluster has 3 nodes, and node 1 was the leader but then got partitioned off. If node 1 doesn't get any more requests, it could still believe it is the leader for every namespace.

Suppose the partition recovers, but node 2 (new leader) cannot send messages to node 1. Then, when node 1 contacts nodes 2 and 3 for the healthcheck, it believes there are two leaders (itself, and node 2). It will try and get timestamps from itself and node 2 as part of the healthcheck. Ideally it should notice immediately it is not the leader and make itself not the leader across all the namespaces, but currently this happens one at a time, meaning the healthchecks take too long (longer than the deployment platform expects a healthcheck to reasonably return by).

jeremyk-91 avatar Jun 04 '20 16:06 jeremyk-91