nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

consistent peer scoring for missing non-finalized parent

Open etan-status opened this issue 2 years ago • 7 comments

When the sync queue processes results for a blocks by range request, and the requested range contained some slots that are already finalized, BlockError.MissingParent currently leads to PeerScoreBadBlocks even when the error occurs on a non-finalized slot in the requested range. This patch changes the scoring in that case to PeerScoreMissingBlocks for consistency with range requests solely covering non-finalized slots, and, likewise, rewinds the sync queue to the next rewindSlot.

etan-status avatar Feb 12 '22 19:02 etan-status

Unit Test Results

       12 files  ±0       860 suites  ±0   1h 13m 22s :stopwatch: - 5m 3s   1 914 tests ±0    1 766 :heavy_check_mark: ±0  148 :zzz: ±0  0 :x: ±0  10 373 runs  ±0  10 167 :heavy_check_mark: ±0  206 :zzz: ±0  0 :x: ±0 

Results for commit 84354a9d. ± Comparison against base commit 615f366d.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Feb 12 '22 21:02 github-actions[bot]

@cheatfate, shall we merge this?

zah avatar Feb 15 '22 21:02 zah

No i dont think so. This PR legend is about different item and this change actually breaks proper behavior.

cheatfate avatar Feb 15 '22 21:02 cheatfate

My thoughts behind it:

  • req.slot is the startSlot of beaconBlocksByRange P2P request (start/count/step).
  • failSlot is the first one that reported MissingParent when applying them in-order (ascending for Forward, and descending for Backward).
  • safeSlot is the finalized head epoch start slot.

If the range request from req.slot through req.slot+count*step-1 contains safeSlot, e.g., clock advanced, safeSlot advanced because of that, and req.slot got finalized, then, in forward case if error happens after safeSlot (i.e. in non-finalized part), req.slot <= safeSlot < failSlot <= req.slot+count*step-1, this currently leads to PeerScoreBadBlocks, no?

And for backward case, the existing logic seems a bit asymmetric as well, because req.slot refers to the lowest slot, i.e. the slot farthest away from safeSlot.

But yeah, it could definitely also be that this is not a problem as is, and/or that the proposed change is incorrect.

etan-status avatar Feb 15 '22 22:02 etan-status

if req.slot == safeSlot we are in trouble. It means that we have broken logic or database in some broken state, or we have pretty bad peer which returns incorrect response. if req.slot <> safeSlot depends on type of syncing (Forward/Backward) means that we just met legal case when peer's response are with gaps.

cheatfate avatar Feb 15 '22 23:02 cheatfate

Neither req.slot nor safeSlot depend on the peer though, req.slot == safeSlot just means that the request we sent out had its startSlot at our finalized head (at time when we received the response).

So, if we send such a request, and peer sends us 30 good blocks, then 1 with missing parent, because maybe it is still doing some backfill sync and does not have one of the intermediate blocks available, we treat it as PeerScoreBadBlocks instead of PeerScoreMissingBlocks. I understand that if the peer sends us blocks < safeSlot that do not apply, that this is indeed a bad peer. But for future blocks, we don't know if their block is bad, it's just that they may lack some data locally.

If the intention is indeed that behaviour, I still think the backwards sync condition needs an adjustment, checking for safeSlot >= req.slot + req.count * req.step to have the equivalent result. safeSlot is at the highest slot of the requested range, and req.slot is at the lowest one.

EDIT: In that example, the 30 good blocks could actually be enough to advance safeSlot due to a finality change, so safeSlot could even be > req.slot regardless of time advancing.

etan-status avatar Feb 15 '22 23:02 etan-status

  • Rebase to unstable
  • Extend tests (only passes with the change from this PR)

etan-status avatar Jun 01 '22 10:06 etan-status