atlasdb icon indicating copy to clipboard operation
atlasdb copied to clipboard

Timelock nodes should propose leadership in the presence of consistent slowness.

Open felixdesouza opened this issue 5 years ago • 0 comments

Internal reference PDS-109404.

Should a node be slow for whatever reason (in this instance, disk latencies are up), requests become slow, i.e. getFreshTimestamps takes 500ms to over a second at times, compared to a norm of about 15-20ms. As far as leadership is concerned, Timelock is "healthy". Responses are slow, but not slow enough for the other nodes to cause an election.

leaderPingResponseWait is the time that followers will wait till we get a response from the leader before proposing. If it is too low, then we might get frequent elections in the presence of network blips. If it is too high, then it's only useful for when the stack is dead/unavailable as opposed to the stack performing poorly.

It might make sense to have a second heuristic that if the pings from the followers are taking longer than some limit after a while (say 5 minutes), then the follower proposes to be the leader. Or from the leader side, if it realises that certain requests e.g. getFreshTimestamps, are above an unacceptable bound, it will then step down.

felixdesouza avatar Jan 27 '20 16:01 felixdesouza