lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Downscore and retry custody failures

Open dapplion opened this issue 7 months ago • 4 comments

Issue Addressed

Partially addresses

  • https://github.com/sigp/lighthouse/issues/6258

Proposed Changes

TBD

1000 lines are for new unit tests :)

Questions / TODO

  • Should custody_by_range requests try all possible peers before giving up? i.e. should they ignore the failures counter for custody failures? Add a test for this behaviour.

Tests to add

  • Peer does not send columns at all on specific index
  • We find no-one serving columns on a specific index, how to recover
  • Random tests where most peer are faulty and we have to keep cycling them. Use randomness and run the tests with different levels of fault % :=> build a simulator
  • Permanently fail a single column many times, but then resolve the rest such that we can reconstruct.
  • Send non-matching data columns
  • Test downscoring of custody failures
  • Test the fallback mechanism of peer selection
  • Test a syncing chain reaching 0 peers, both for forward sync and backwards sync
  • Test forwards sync on Fulu with blocks without data

Future TODOs

  • Ensure uniform use of subscribe_all_data_column_subnets to determine is_supernode
  • Ensure consistent naming in sync network context handlers on_xxx_response, suggestion -> https://github.com/sigp/lighthouse/pull/7510#discussion_r2107334518
  • Cleanup add_peer_* functions in sync tests. Best to do in another PR as it touches a lot of lookup sync tests (unrelated)
  • Track requests_per_peer to ensure that eventually we fetch data from all peers in the chain

dapplion avatar May 22 '25 05:05 dapplion

I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding TODOs. I find it quite difficult to understand the changes, assumptions, intentions and how the working solution would eventually look like.

I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think?

jimmygchen avatar May 27 '25 08:05 jimmygchen

I've just merged latest peerdas-devnet-7 into this branch and CI should pass now. Will do another round of review tomorrow.

I've triggerd a CI workflow to run the latest checkpoint sync test on this PR: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885

jimmygchen avatar Jun 03 '25 08:06 jimmygchen

Looks like it's struggling with backfill - below link is a test run on sepolia - both nodes transitioned to Synced without completing backfill. The logs are also available on the CI workflow: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155087

On PeerDAS, supernode struggles to sync to head, but fullnode did complete backfilling 1000 slots in 506 seconds: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155070

Logs are available on the workflow summary page: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885

jimmygchen avatar Jun 03 '25 13:06 jimmygchen

@dapplion Would you mind taking a look at the failing CI tests in the PR?

jimmygchen avatar Jun 04 '25 06:06 jimmygchen

Not sure if this is ready for review? Its crashing on mainnet with

Jun 16 10:40:58.395 CRIT  Task panic. This is a bug!                    location: "/Users/pawan/ethereum/lighthouse/beacon_node/network/src/sync/network_context/block_components_by_range.
rs:461:18", msg_id: "TODO: don't do matching here: MissingBlobs", backtrace:    0: std::backtrace::Backtrace::create

pawanjay176 avatar Jun 16 '25 08:06 pawanjay176

Hi @dapplion, this pull request has been closed automatically due to 30 days of inactivity. If you’d like to continue working on it, feel free to reopen at any time.

mergify[bot] avatar Jul 12 '25 17:07 mergify[bot]

  • Superseded by https://github.com/sigp/lighthouse/issues/7678

dapplion avatar Jul 13 '25 10:07 dapplion