lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Sync lookup requests blocks from peers that may not have it

Open dapplion opened this issue 1 year ago • 1 comments

Sync lookup NoResponseReturned downscoring issue

Creating a child lookup with a peer from UnknownParentBlock is not okay since that peer may not know about the block. Consider the following sequence of events:

  • Peer receives blob of block before block
  • Peer fowards blob of block to us
  • We don't know the parent of block and trigger UnknownParentBlob(parent, block, peer)
  • Create a lookup for parent
  • Create a lookup for block
  • Lookup for block requests block to peer
  • Peer respons with empty
  • We penalize peer with NoResponseReturned

Instead we should:

  1. Wait for a gossip attestation and request that peer for the block, or an unknown parent event for its child block
  2. Don't create a current lookup. Cache child components somewhere else, or not at all
  3. Allow peers to return empty. This will cause "zombie" lookups where no peer has the block but we keep the lookup around somehow

Solution 1

Create a child lookup with 0 peers and cache the child components there. Scenarios:

  • Block has no blobs, so after the parent finishes processing the child block is imported immediately
  • Block has blobs, but lookup still has 0 peers. What to do? The lookup will fail and the cached block will be dropped

Example A:

  • UnknownParentBlock(parent, block)
  • Create lookup for block
  • Resolve parent
  • Lookup for block dropped because NoPeers

Example B:

  • UnknownParentBlock(parent, block)
  • Create lookup for block
  • UnknownBlockFromAttestation(block)
  • Resolve parent
  • Lookup for block continues with new peer

Solution 2

Don't create current block lookup on UnknownParentBlock events, instead cache child components somewhere else. They can be cached in the sync network context. Then when the lookup wants to request blocks, immediately reply with the cached blocks.

This is a bit ugly because it forces the network context into sending sync events. It introduces a cache of blocks that should be pruned with some timeout

We can also choose to not cache child components at all


Thoughts? Any better solution?

dapplion avatar May 03 '24 08:05 dapplion

I think in this case, we should be creating the lookup only for the parent and not for the block referenced by the blob we received over gossip. We can wait for the current block components to arrive over gossip or wait for an attestation referencing the block to start a request for the current block.

This scenario should be quite rare in the 4844 world. This is because we haven't heard about a valid block(the parent block in your scenario), or an attestation referencing the valid parent block for the entire duration of its slot and are hearing about it only when its suddenly referenced by the current block. If we start getting attestations for the current block, we just retrieve it over rpc and take the minor hit in bandwidth. I feel handling this case explicitly will add code complexity which isn't worth maintaining.

pawanjay176 avatar May 03 '24 21:05 pawanjay176

Fixed with

  • https://github.com/sigp/lighthouse/pull/5724

dapplion avatar May 24 '24 18:05 dapplion