Downscore and retry custody failures
Issue Addressed
Partially addresses
- https://github.com/sigp/lighthouse/issues/6258
Proposed Changes
TBD
1000 lines are for new unit tests :)
Questions / TODO
- Should custody_by_range requests try all possible peers before giving up? i.e. should they ignore the failures counter for custody failures? Add a test for this behaviour.
Tests to add
- Peer does not send columns at all on specific index
- We find no-one serving columns on a specific index, how to recover
- Random tests where most peer are faulty and we have to keep cycling them. Use randomness and run the tests with different levels of fault % :=> build a simulator
- Permanently fail a single column many times, but then resolve the rest such that we can reconstruct.
- Send non-matching data columns
- Test downscoring of custody failures
- Test the fallback mechanism of peer selection
- Test a syncing chain reaching 0 peers, both for forward sync and backwards sync
- Test forwards sync on Fulu with blocks without data
Future TODOs
- Ensure uniform use of
subscribe_all_data_column_subnetsto determineis_supernode - Ensure consistent naming in sync network context handlers
on_xxx_response, suggestion -> https://github.com/sigp/lighthouse/pull/7510#discussion_r2107334518 - Cleanup
add_peer_*functions in sync tests. Best to do in another PR as it touches a lot of lookup sync tests (unrelated) - Track
requests_per_peerto ensure that eventually we fetch data from all peers in the chain
I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding TODOs. I find it quite difficult to understand the changes, assumptions, intentions and how the working solution would eventually look like.
I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think?
I've just merged latest peerdas-devnet-7 into this branch and CI should pass now. Will do another round of review tomorrow.
I've triggerd a CI workflow to run the latest checkpoint sync test on this PR: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885
Looks like it's struggling with backfill - below link is a test run on sepolia - both nodes transitioned to Synced without completing backfill. The logs are also available on the CI workflow:
https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155087
On PeerDAS, supernode struggles to sync to head, but fullnode did complete backfilling 1000 slots in 506 seconds: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155070
Logs are available on the workflow summary page: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885
@dapplion Would you mind taking a look at the failing CI tests in the PR?
Not sure if this is ready for review? Its crashing on mainnet with
Jun 16 10:40:58.395 CRIT Task panic. This is a bug! location: "/Users/pawan/ethereum/lighthouse/beacon_node/network/src/sync/network_context/block_components_by_range.
rs:461:18", msg_id: "TODO: don't do matching here: MissingBlobs", backtrace: 0: std::backtrace::Backtrace::create
Hi @dapplion, this pull request has been closed automatically due to 30 days of inactivity. If you’d like to continue working on it, feel free to reopen at any time.
- Superseded by https://github.com/sigp/lighthouse/issues/7678