lighthouse
lighthouse copied to clipboard
Improving blob propagation post-PeerDAS with Decentralized Blob Building
Issue Addressed
Built on top of #5829, an optimization came up by @michaelsproul, to fetches blobs from the EL to reduce the delay to block import.
This PR goes further to publish the blobs to the network, which helps improve the resiliency of the network, by having nodes with more resources contribute to blob propagation. This experimental solution is an attempt to solve the self building proposer bandwidth issue discussed on R&D Discord and described in @dankrad's post here.
The benefits of this proposals are:
- Reduces block import latency: nodes can retrieve blobs from EL without waiting for them from gossip, hence making blocks attestable earlier.
- Improves blob propagation and network resiliency: blob propagation work from is spread out from 1 node to the entire network, which reduces the likelihood of missed block due to delays in propagtion.
- Allows scaling without sacrificing decentralization: nodes with more resources will participate in blob building and propagation, allowing nodes with limited bandwidth to continue to produce block post-PeerDAS.
Proposed Changes
- Deneb:
fetch_blobs_and_publishis triggered after a node has processed a gossip / rpc block and is still missing blob components. Once the node fetches the blob from EL, it then publishes the remaining blobs that hasn't seen on gossip to the network. - PeerDAS: Same trigger as above, however only supernodes will publish data columns that are unseen on gossip to the network.
Next steps:
- To maintain low bandwidth for smaller stakers (single validator BN), we could allow some optimisation on block publish behaviour for these nodes only. There are some strategies proposed by @cskiraly to bring the outbound bandwidth requirements for a 32 blobs block to the same level as Deneb (6 blobs). However this wouldn't be recommended for nodes with enough bandwidth.
- Collect some realistic metrics for a network with 32 blobs per block.
Challenges:
- Current KZG libraries (
c-kzg-4844andrust-eth-kzg) may struggle with constructing large number of cells and proofs at once due to the current memory allocation approach. - Even if we are able reduce the bandwidth usage on the CL side, the bandwidth challenge remains on the EL side, as the node still need to pull the blob transactions into its mempool, to a lesser extent though, because:
- it's dealing with raw blobs (4096kb for 32 blobs) rather than erasure coded blobs
- it's pulled-based (
eth/68) hence doesn't incur the same gossip amplification cost (8x) on the CL.
TODO before merging
- [x] Wait for spec to be agreed on by EL clients
- https://github.com/ethereum/consensus-specs/pull/3864
- https://github.com/ethereum/execution-apis/pull/559
Reference:
- @michaelsproul's original PR targeting
unstablehttps://github.com/sigp/lighthouse/pull/5829 - R&D Discord discussion thread
- How to help self builders with blobs by @dankrad
- Fetch blobs from EL pool by @dapplion
Some early testing results:
- Proposers withholding all blobs propose blocks with blobs with 100% success rate
- No outbound bandwidth spike for the full nodes with limited upload
Bandwidth-limited fullnode (cl-01) vs supernode (cl-02):
(Thanks to @KatyaRyazantseva for the dashboard above☺️ )
This shows EL inbound traffic (fetch blos from peers) isn't too bad for MAX 6 blobs The outbound traffic for EL is less relevant here because it includes sending blobs to CL.

Next steps:
- Add more metrics
- Blocks made available via EL blobs
- Number of blobs / data columns from EL blobs published
- EL blob fetch timing
- Compute cells and proof time
- Make
MAX_BLOBS_PER_BLOCKconfigurable - Try 32 blobs per block
- EL gas constant update
- Potential update on derived configs
- Potential batching of KZG computation to avoid overflow
- Add
--limit-blob-publish(single validator only), which allows for lower mesh peers for data columns topics and withholding certain amount of data columns
This was originally intended experimental but it's been pretty stable, 800 epochs on devnet and 100% participation, and I think we can merge this. I'll address the review comments, thanks @dapplion for the review 🙏
Add --limit-blob-publish (single validator only), which allows for lower mesh peers for data columns topics and withholding certain amount of data columns
Bring the attack to prod
@jimmygchen Any issues with the kzg libraries?
Bring the attack to prod
🙈 Trying to not mention the flag I use for testing. Luckily git blame on that line shows someone else's name haha
Just realized we need the execution-apis PR is merged before merge this - although there's probably no harm getting this one merged as soon as the inclusion is confirmed, as the fetch_blobs function handles this gracefully.
@jimmygchen Any issues with the kzg libraries?
Nope all good so far! I was just flagging the potential challenges that I could imagine when we increase blob count, but I haven't actually run into any issues yet. Will keep you updated, thanks ☺️
engine_getBlobsV1 now implemented in
- Nethermind : https://github.com/NethermindEth/nethermind/pull/7322 (merged)
- Reth: https://github.com/paradigmxyz/reth/pull/9723 (open)
I think we should review blob publishing before merging this to avoid adding excessive bandwidth usage for Deneb.
Be great to have https://github.com/sigp/lighthouse/pull/6403 merged first, as there will be some more conflicts that needs to be resolved, and a bit easier in this order. We also need to prioritise block publishing, could do it in this PR.
Remaining TODOs
- [x] prioritise block publishing (https://github.com/sigp/lighthouse/pull/6268/commits/0a2c6f76c790e2117ed4a225ef312795eb26f240)
- [x] send
IDONTWANTon publish- #6513
I'm observing a few things from testing:
The node sometimes end up publishing all blobs, if the pending component is not found in availability cache after the block has been made available (cached_blob_indexes returns None below):
https://github.com/sigp/lighthouse/blob/0f32b73ebc1505b3495d3719e158ac4630d7020b/beacon_node/beacon_chain/src/fetch_blobs.rs#L153-L165
The evidence is that we also see a blob processing error DuplicateFullyImported when trying to process the already imported blobs:
Oct 30 10:06:02.780 DEBG Fetching blobs from the EL, num_expected_blobs: 6, block_root: 0x8ae0291f630ab24311fa859a2e86c6fb8995fb2df420516ecbbe3a8eaa339ebf, service: fetch_engine_blobs, service: beacon, module: beacon_chain::fetch_blobs:78
Oct 30 10:06:02.968 DEBG Processing engine blobs, num_fetched_blobs: 6, block_root: 0x8ae0291f630ab24311fa859a2e86c6fb8995fb2df420516ecbbe3a8eaa339ebf, service: fetch_engine_blobs, service: beacon, module: beacon_chain::fetch_blobs:170
Oct 30 10:06:02.968 ERRO Error fetching or processing blobs from EL, block_hash: 0x8ae0291f630ab24311fa859a2e86c6fb8995fb2df420516ecbbe3a8eaa339ebf, error: BlobProcessingError(DuplicateFullyImported(0x8ae0291f630ab24311fa859a2e86c6fb8995fb2df420516ecbbe3a8eaa339ebf)), module: network::network_beacon_processor:937
Oct 30 10:06:03.871 DEBG Batch blob publication complete, block_root: 0x8ae0291f630ab24311fa859a2e86c6fb8995fb2df420516ecbbe3a8eaa339ebf, published_count: 6, blob_count: 6, batch_interval: 300, service: beacon, module: network::network_beacon_processor:1066
Possible solution:
- ~~Check fork choice again before publishing and processing engine blobs~~
- Downgrade error
BlockError::DuplicateFullyImportedto a debug log, as this race can sometimes happens
UPDATE: I think a better solution would be gossip verifying the blobs prior to publishing - this is consistent with our blob publishing logic and if the blob has already been observed, then we would get GossipBlobError::RepeatBlob and not publish and process the duplicate blob.
@dapplion @michaelsproul I've made a few more changes and have tested this locally. This is ready for review now:
- Fix invalid inclusion proof for EL blobs
- Only publish blobs triggered from gossip block and not RPC blocks
I'll deploy this to some testnet nodes.
Some findings from mainnet testing from a small sample size:
- EL
getBlobsonly gets triggered once roughly every ~5-10 mins. This is likely because relays publish blobs before blocks. This may change in the future though when blob count increases, or if nodes prioritise block publishing. - Interestingly, the pattern is very inconsistent (sometimes
getBlobscould get triggered for 5-6 consecutive blocks), and it seems to be happen more onultrasoundrelay and non builder blocks. (perhaps they don't prioritise sending blobs as much as other relays?) - A large percentage of them are blocks with 5 or 6 blobs - which is what we expect, as proposer would have to disseminate more blob data to its peers and more likely to have delays.
- With the gradual publication optimisation, we only publish about 1 blob roughly every 10 mins, so it should have minimal impact on outbound bandwidth.
@mergify queue
queue
✅ The pull request has been merged automatically
The pull request has been merged automatically at 5f053b0b6dfb7dba8f4455b6fea1d3e6777cea99
One more thing I was thinking about - we may not want to always fetch blobs on rpc response, as it only makes sense / works if we're close to the head slot, but I guess process_rpc_block only happens if we're close to the head (block lookup and head sync), so I don't think there's anything we need to address here, but just wanted to keep a note.
https://github.com/sigp/lighthouse/blob/8060e86275f169b26e16b41e54a7e2e50fd33774/beacon_node/network/src/network_beacon_processor/sync_methods.rs#L192-L199