atlasdb
atlasdb copied to clipboard
Timelock nodes should propose leadership in the presence of consistent slowness.
Internal reference PDS-109404.
Should a node be slow for whatever reason (in this instance, disk latencies are up), requests become slow, i.e. getFreshTimestamps takes 500ms to over a second at times, compared to a norm of about 15-20ms. As far as leadership is concerned, Timelock is "healthy". Responses are slow, but not slow enough for the other nodes to cause an election.
leaderPingResponseWait
is the time that followers will wait till we get a response from the leader before proposing. If it is too low, then we might get frequent elections in the presence of network blips. If it is too high, then it's only useful for when the stack is dead/unavailable as opposed to the stack performing poorly.
It might make sense to have a second heuristic that if the pings from the followers are taking longer than some limit after a while (say 5 minutes), then the follower proposes to be the leader. Or from the leader side, if it realises that certain requests e.g. getFreshTimestamps, are above an unacceptable bound, it will then step down.