atlasdb
atlasdb copied to clipboard
Leader ping health check can get stuck in a long loop
See PDS-120860.
Suppose a timelock cluster has 3 nodes, and node 1 was the leader but then got partitioned off. If node 1 doesn't get any more requests, it could still believe it is the leader for every namespace.
Suppose the partition recovers, but node 2 (new leader) cannot send messages to node 1. Then, when node 1 contacts nodes 2 and 3 for the healthcheck, it believes there are two leaders (itself, and node 2). It will try and get timestamps from itself and node 2 as part of the healthcheck. Ideally it should notice immediately it is not the leader and make itself not the leader across all the namespaces, but currently this happens one at a time, meaning the healthchecks take too long (longer than the deployment platform expects a healthcheck to reasonably return by).