nimbus-eth2
nimbus-eth2 copied to clipboard
consistent peer scoring for missing non-finalized parent
When the sync queue processes results for a blocks by range request,
and the requested range contained some slots that are already finalized,
BlockError.MissingParent
currently leads to PeerScoreBadBlocks
even
when the error occurs on a non-finalized slot in the requested range.
This patch changes the scoring in that case to PeerScoreMissingBlocks
for consistency with range requests solely covering non-finalized slots,
and, likewise, rewinds the sync queue to the next rewindSlot
.
Unit Test Results
12 files ±0 860 suites ±0 1h 13m 22s :stopwatch: - 5m 3s 1 914 tests ±0 1 766 :heavy_check_mark: ±0 148 :zzz: ±0 0 :x: ±0 10 373 runs ±0 10 167 :heavy_check_mark: ±0 206 :zzz: ±0 0 :x: ±0
Results for commit 84354a9d. ± Comparison against base commit 615f366d.
:recycle: This comment has been updated with latest results.
@cheatfate, shall we merge this?
No i dont think so. This PR legend is about different item and this change actually breaks proper behavior.
My thoughts behind it:
-
req.slot
is thestartSlot
ofbeaconBlocksByRange
P2P request (start/count/step). -
failSlot
is the first one that reportedMissingParent
when applying them in-order (ascending for Forward, and descending for Backward). -
safeSlot
is the finalized head epoch start slot.
If the range request from req.slot
through req.slot+count*step-1
contains safeSlot
, e.g., clock advanced, safeSlot advanced because of that, and req.slot got finalized, then, in forward case if error happens after safeSlot
(i.e. in non-finalized part), req.slot <= safeSlot < failSlot <= req.slot+count*step-1
, this currently leads to PeerScoreBadBlocks
, no?
And for backward case, the existing logic seems a bit asymmetric as well, because req.slot
refers to the lowest slot, i.e. the slot farthest away from safeSlot
.
But yeah, it could definitely also be that this is not a problem as is, and/or that the proposed change is incorrect.
if req.slot == safeSlot
we are in trouble. It means that we have broken logic or database in some broken state, or we have pretty bad peer which returns incorrect response. if req.slot <> safeSlot
depends on type of syncing (Forward/Backward) means that we just met legal case when peer's response are with gaps.
Neither req.slot
nor safeSlot
depend on the peer though, req.slot == safeSlot
just means that the request we sent out had its startSlot
at our finalized head (at time when we received the response).
So, if we send such a request, and peer sends us 30 good blocks, then 1 with missing parent, because maybe it is still doing some backfill sync and does not have one of the intermediate blocks available, we treat it as PeerScoreBadBlocks
instead of PeerScoreMissingBlocks
.
I understand that if the peer sends us blocks < safeSlot
that do not apply, that this is indeed a bad peer. But for future blocks, we don't know if their block is bad, it's just that they may lack some data locally.
If the intention is indeed that behaviour, I still think the backwards sync condition needs an adjustment, checking for safeSlot >= req.slot + req.count * req.step
to have the equivalent result. safeSlot
is at the highest slot of the requested range, and req.slot
is at the lowest one.
EDIT: In that example, the 30 good blocks could actually be enough to advance safeSlot
due to a finality change, so safeSlot
could even be > req.slot
regardless of time advancing.
- Rebase to
unstable
- Extend tests (only passes with the change from this PR)