go-spacemesh
go-spacemesh copied to clipboard
Current ticker used for round intervals may skip rounds
Round events (ticks) that are not handled (pulled) on time (before the next tick) causes the ticker to drop the tick. Hence, we might miss the round that follows the unhandled tick. By definition, if it takes more than a round duration to read the next tick (because of handling messages) it means we are faulty the cause to it is a bug that should be fixed. On the other hand, if it happens to a node once, he will be out of sync with the rounds since he missed one of the ticks. It is possible to implement a wrapper to the ticker or any variation that sends the name of the id of the round that should be started. That way, when a node misses one round, he will jump straight to the following round instead of getting out of sync.
- What is the priority of this issue?
- Should suggest a concrete solution.
@barakshani @antonlerner what do you think should be the priority of this issue. In my opinion, it is of medium importance but not important at all for the test net (hence should not be labeled "MUST").
a. we are now working on improving the run time of one round. afer these improvements this will be a less severe issue and therefore I think we shouldn't handle it till testnet.
I can think of two solutions: (1) Wait on multiple rounds at a time. Maybe always wait on the next 2 rounds, each in their own goroutine? And cancel the waiting goroutines if that consensus process terminates. This is relatively cheap because Go won't try to schedule the waiting goroutines again until the channel notifies. (2) Add another function to the hare clock to allow you to query the current round instead of trying wait until a certain round has passed. Then, in the main loop of hare, we can poll instead of "wait" for a round that never comes.
The api of the hare clock only let's you AwaitEndOfRound(round uint32)
instead of checking the current round.
https://github.com/spacemeshos/go-spacemesh/blob/0cb3732b0796ed442366f39ef8cf3909a0c2b48d/hare/clock.go#L40-L51
To elaborate on the original issue:
The consensus process only moves forward if case <-endOfRound: // next round event
is selected fast enough so that we don't end up out of sync because proc.advanceToNextRound(ctx)
needs to be called in time for the next channel from proc.clock.AwaitEndOfRound(k)
to be an accurate notifier. Otherwise, we always end up waiting for a round that has already happened or started much earlier than we were aware.
https://github.com/spacemeshos/go-spacemesh/blob/0cb3732b0796ed442366f39ef8cf3909a0c2b48d/hare/algorithm.go#L319-L358