go-spacemesh icon indicating copy to clipboard operation
go-spacemesh copied to clipboard

implementation/protocol doesn't handle well short network splits

Open dshulyak opened this issue 2 years ago • 2 comments

  • consider node that goes offline for less then one layer, but for enough time to miss most of the hare
  • eventually it will fail and terminate with empty layer. local opinion within hdist = empty layer
  • tortoise can't verify such layer using verifying mode, since everyone else votes differently from the local opinion
  • at this point node is effectively in a fork with the rest of the network
  • it will stay in a fork for hdist layers. or until sync fixes local opinion (right now it doesn't because it considers this layer as terminated) and triggers rerun
  • hdist is planned to be 200 layers with 5 min each. node is effectively useless for almost a whole day, so it rather weird to just wait
  • rerun is not really cheap even with verifying tortoise, and this way sync will have to trigger it every time when it sees different layer hash

dshulyak avatar Jun 30 '22 08:06 dshulyak

(internal) slack discussion is here

lrettig avatar Jul 01 '22 20:07 lrettig

Tal's proposed solution:

"hare failed" and "detected offline" are two different things When the hare fails, we vote for an empty layer. When we detect we're offline, we treat it as if the hare is still in progress until we receive a termination certificate (or time out due to too many layers passing)

lrettig avatar Jul 01 '22 20:07 lrettig

It is expected that tortoise terminates within zdist, for this distance tortoise will vote abstain for the layer that wasn't terminated. if during this distance tortoise haven't received an opinion for the layer (empty or for a particular block) it will start voting for an empty layer just because of the distance.

problem that i described in the original message exists because we do not reconsider local opinion if block certificate was received after zdist. if tortoise would reconsider local opinion, and fix "goodness" of the ballots it will remain operational using verifying tortoise. therefore the implementation plan is:

  • add OnHareOutput(LayerID, BlockID) method, that can be called at any time when syncer receives certificate
  • if OnHareOutput called with a block that is still within hdist verifying tortoise will do a cheap rerun (cheap because it will use only in-memory data)
  • during this rerun "goodness" of the ballots will be recomputed, in order for verifying tortoise to make progress
  • as a side effect of the first action tortoise will also start voting supporting the layer

dshulyak avatar Aug 31 '22 08:08 dshulyak

i think the proposed impl make sense.

two additional points:

  • i'm also considering passing block validity from syncer->tortoise in a similar fashion. e.g. OnBlockValidity() or something
  • isn't it true that we still need to support voting against an abstain? does it make sense to add diff that support an empty block id in that case?

countvonzero avatar Aug 31 '22 14:08 countvonzero

i'm also considering passing block validity from syncer->tortoise in a similar fashion. e.g. OnBlockValidity() or something

i am not sure that it is needed. opinion on blocks outside hdist should be saved as validity by syncer

isn't it true that we still need to support voting against an abstain? does it make sense to add diff that support an empty block id in that case?

is it the same question https://github.com/spacemeshos/go-spacemesh/pull/3513#issuecomment-1233249403 ? i don't think that such thing as support for empty block id can exist

dshulyak avatar Sep 01 '22 05:09 dshulyak

i'm also considering passing block validity from syncer->tortoise in a similar fashion. e.g. OnBlockValidity() or something

i am not sure that it is needed. opinion on blocks outside hdist should be saved as validity by syncer

question here is that how does tortoise refreshes its opinions in memory. here you proposed OnHareOutput to be passed via memory (while certificate is saved in db), but block validity outside hdist should be saved in db and not passed via memory. what is your rationale here?

isn't it true that we still need to support voting against an abstain? does it make sense to add diff that support an empty block id in that case?

is it the same question #3513 (comment) ? i don't think that such thing as support for empty block id can exist

yes it is the same. it's a dumb suggestion on my part.

countvonzero avatar Sep 01 '22 21:09 countvonzero

question here is that how does tortoise refreshes its opinions in memory.

tortoise doesn't update state before processed layer by itself, and always needs to be notified the only exception is rerun, because then everything re-downloaded from scratch

i thought that sync never needs to update validity for the layer after calling HandleIncomingLayer. but if there are some conditions when it is needed it probably makes sense to add one more handler

dshulyak avatar Sep 02 '22 04:09 dshulyak