lighthouse Downscore and retry custody failures

Issue Addressed

Partially addresses

https://github.com/sigp/lighthouse/issues/6258

Proposed Changes

TBD

1000 lines are for new unit tests :)

Questions / TODO

Should custody_by_range requests try all possible peers before giving up? i.e. should they ignore the failures counter for custody failures? Add a test for this behaviour.

Tests to add

Peer does not send columns at all on specific index
We find no-one serving columns on a specific index, how to recover
Random tests where most peer are faulty and we have to keep cycling them. Use randomness and run the tests with different levels of fault % :=> build a simulator
Permanently fail a single column many times, but then resolve the rest such that we can reconstruct.
Send non-matching data columns
Test downscoring of custody failures
Test the fallback mechanism of peer selection
Test a syncing chain reaching 0 peers, both for forward sync and backwards sync
Test forwards sync on Fulu with blocks without data

Future TODOs

Ensure uniform use of subscribe_all_data_column_subnets to determine is_supernode
Ensure consistent naming in sync network context handlers on_xxx_response, suggestion -> https://github.com/sigp/lighthouse/pull/7510#discussion_r2107334518
Cleanup add_peer_* functions in sync tests. Best to do in another PR as it touches a lot of lookup sync tests (unrelated)
Track requests_per_peer to ensure that eventually we fetch data from all peers in the chain

May 22 '25 05:05 dapplion

I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding TODOs. I find it quite difficult to understand the changes, assumptions, intentions and how the working solution would eventually look like.

I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think?

May 27 '25 08:05 jimmygchen

I've just merged latest peerdas-devnet-7 into this branch and CI should pass now. Will do another round of review tomorrow.

I've triggerd a CI workflow to run the latest checkpoint sync test on this PR: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885

Jun 03 '25 08:06 jimmygchen

Looks like it's struggling with backfill - below link is a test run on sepolia - both nodes transitioned to Synced without completing backfill. The logs are also available on the CI workflow: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155087

On PeerDAS, supernode struggles to sync to head, but fullnode did complete backfilling 1000 slots in 506 seconds: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155070

Logs are available on the workflow summary page: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885

Jun 03 '25 13:06 jimmygchen

@dapplion Would you mind taking a look at the failing CI tests in the PR?

Jun 04 '25 06:06 jimmygchen

Not sure if this is ready for review? Its crashing on mainnet with

Jun 16 10:40:58.395 CRIT  Task panic. This is a bug!                    location: "/Users/pawan/ethereum/lighthouse/beacon_node/network/src/sync/network_context/block_components_by_range.
rs:461:18", msg_id: "TODO: don't do matching here: MissingBlobs", backtrace:    0: std::backtrace::Backtrace::create

Jun 16 '25 08:06 pawanjay176

Hi @dapplion, this pull request has been closed automatically due to 30 days of inactivity. If you’d like to continue working on it, feel free to reopen at any time.

Jul 12 '25 17:07 mergify[bot]

Superseded by https://github.com/sigp/lighthouse/issues/7678

Jul 13 '25 10:07 dapplion