lodestar icon indicating copy to clipboard operation
lodestar copied to clipboard

[fulu] Sync stuck due to peers returning 0 block for beacon_block_by_range request

Open twoeths opened this issue 5 months ago • 9 comments

Describe the bug

sync was stuck with this log

2025-08-08 11:07:09.405	
Aug-08 04:07:09.405[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=8, status=AwaitingProcessing, blocks=0, peer=16...mAswsc



2025-08-08 11:07:09.177	
Aug-08 04:07:09.176[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=8, status=AwaitingDownload, peer=16...mAswsc
2025-08-08 11:07:09.177	
Aug-08 04:07:09.176[sync]          verbose: Batch download error id=Finalized-0, startEpoch=8, status=Downloading, peer=16...aDqVMg, code=REQUEST_ERROR_TTFB_TIMEOUT
2025-08-08 11:07:04.143	
Aug-08 04:07:04.143[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=8, status=AwaitingDownload, peer=16...aDqVMg
2025-08-08 11:07:04.143	
Aug-08 04:07:04.143[sync]          verbose: Advanced chain id=Finalized-0, lastEpochWithProcessBlocks=6
2025-08-08 11:07:03.947	
Aug-08 04:07:03.947[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=5, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...dRN2bf
2025-08-08 11:07:03.941	
Aug-08 04:07:03.941[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=6, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...dRN2bf
2025-08-08 11:07:03.581	
Aug-08 04:07:03.581[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=7, status=AwaitingProcessing, blocks=0, peer=16...mAswsc
2025-08-08 11:07:03.344	
Aug-08 04:07:03.343[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=5, status=AwaitingDownload, peer=16...dRN2bf
2025-08-08 11:07:03.344	
Aug-08 04:07:03.343[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=7, status=AwaitingDownload, peer=16...mAswsc
2025-08-08 11:07:03.344	
Aug-08 04:07:03.343[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=6, status=AwaitingDownload, peer=16...dRN2bf
2025-08-08 11:07:03.343	
Aug-08 04:07:03.343[sync]          verbose: Batch validation error id=Finalized-0, startEpoch=6, status=AwaitingValidation
2025-08-08 11:07:03.343	
Aug-08 04:07:03.343[sync]          verbose: Batch validation error id=Finalized-0, startEpoch=5, status=AwaitingValidation
2025-08-08 11:07:03.343	
Aug-08 04:07:03.343[sync]          verbose: Batch process error id=Finalized-0, startEpoch=7, status=Processing, code=BLOCK_ERROR_PARENT_UNKNOWN, parentRoot=0x907ff88e87331029dba28ecb50b0dcb8d073e51b069369dd2b357a9610cd3b1e
2025-08-08 11:07:03.308	
Aug-08 04:07:03.308[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=7, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...dRN2bf
2025-08-08 11:07:02.660	
Aug-08 04:07:02.660[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=7, status=AwaitingDownload, peer=16...dRN2bf
2025-08-08 11:07:02.660	
Aug-08 04:07:02.659[sync]          verbose: Batch download error id=Finalized-0, startEpoch=7, status=Downloading, peer=16...aDqVMg, code=REQUEST_ERROR_TTFB_TIMEOUT
2025-08-08 11:07:02.609	
Aug-08 04:07:02.609[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=6, status=AwaitingProcessing, blocks=0, peer=16...mAswsc
2025-08-08 11:07:02.400	
Aug-08 04:07:02.388[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=6, status=AwaitingDownload, peer=16...mAswsc
2025-08-08 11:07:02.400	
Aug-08 04:07:02.388[sync]          verbose: Batch download error id=Finalized-0, startEpoch=6, status=Downloading, peer=16...aDqVMg, code=REQUEST_ERROR_TTFB_TIMEOUT
2025-08-08 11:06:57.626	
Aug-08 04:06:57.626[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=7, status=AwaitingDownload, peer=16...aDqVMg
2025-08-08 11:06:57.626	
Aug-08 04:06:57.625[sync]          verbose: Batch download error id=Finalized-0, startEpoch=7, status=Downloading, peer=16...fBGqqd, code=REQUEST_ERROR_RESP_TIMEOUT
2025-08-08 11:06:57.537	
Aug-08 04:06:57.537[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=7, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:57.537	
Aug-08 04:06:57.537[sync]          verbose: Advanced chain id=Finalized-0, lastEpochWithProcessBlocks=5
2025-08-08 11:06:57.357	
Aug-08 04:06:57.357[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=6, status=AwaitingDownload, peer=16...aDqVMg
2025-08-08 11:06:57.357	
Aug-08 04:06:57.357[sync]          verbose: Advanced chain id=Finalized-0, lastEpochWithProcessBlocks=4
2025-08-08 11:06:57.323	
Aug-08 04:06:57.323[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=5, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...fBGqqd
2025-08-08 11:06:57.226	
Aug-08 04:06:57.226[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=5, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:57.226	
Aug-08 04:06:57.225[sync]          verbose: Batch download error id=Finalized-0, startEpoch=5, status=Downloading, peer=16...aDqVMg, code=REQUEST_ERROR_TTFB_TIMEOUT
2025-08-08 11:06:57.146	
Aug-08 04:06:57.146[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=4, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...fBGqqd
2025-08-08 11:06:57.017	
Aug-08 04:06:57.016[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=4, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:57.017	
Aug-08 04:06:57.016[sync]          verbose: Batch download error id=Finalized-0, startEpoch=4, status=Downloading, peer=16...aDqVMg, code=REQUEST_ERROR_TTFB_TIMEOUT
2025-08-08 11:06:52.191	
Aug-08 04:06:52.190[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=5, status=AwaitingDownload, peer=16...aDqVMg
2025-08-08 11:06:52.190	
Aug-08 04:06:52.190[sync]          verbose: Batch download error id=Finalized-0, startEpoch=5, status=Downloading, peer=16...fBGqqd, code=REQUEST_ERROR_RESP_TIMEOUT
2025-08-08 11:06:52.097	
Aug-08 04:06:52.097[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=5, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:52.097	
Aug-08 04:06:52.097[sync]          verbose: Advanced chain id=Finalized-0, lastEpochWithProcessBlocks=3
2025-08-08 11:06:51.986	
Aug-08 04:06:51.985[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=4, status=AwaitingDownload, peer=16...aDqVMg
2025-08-08 11:06:51.985	
Aug-08 04:06:51.985[sync]          verbose: Batch download error id=Finalized-0, startEpoch=4, status=Downloading, peer=16...fBGqqd, code=REQUEST_ERROR_RESP_TIMEOUT
2025-08-08 11:06:51.858	
Aug-08 04:06:51.858[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=4, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:51.858	
Aug-08 04:06:51.858[sync]          verbose: Advanced chain id=Finalized-0, lastEpochWithProcessBlocks=2
2025-08-08 11:06:51.650	
Aug-08 04:06:51.650[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=2, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...fBGqqd
2025-08-08 11:06:51.520	
Aug-08 04:06:51.519[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=2, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:51.520	
Aug-08 04:06:51.519[sync]          verbose: Batch download error id=Finalized-0, startEpoch=2, status=Downloading, peer=16...aDqVMg, code=REQUEST_ERROR_TTFB_TIMEOUT
2025-08-08 11:06:50.420	
Aug-08 04:06:50.420[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=3, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...fBGqqd
2025-08-08 11:06:50.292	
Aug-08 04:06:50.292[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=3, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:50.292	
Aug-08 04:06:50.292[sync]          verbose: Advanced chain id=Finalized-0, lastEpochWithProcessBlocks=1
2025-08-08 11:06:49.856	
Aug-08 04:06:49.855[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=0, status=AwaitingProcessing, blocks=31, blobs=0, dataColumns=0, peer=16...fBGqqd
2025-08-08 11:06:49.725	
Aug-08 04:06:49.725[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=0, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:49.725	
Aug-08 04:06:49.724[sync]          verbose: Batch download error id=Finalized-0, startEpoch=0, status=Downloading, peer=16...aDqVMg, code=REQUEST_ERROR_TTFB_TIMEOUT
2025-08-08 11:06:46.488	
Aug-08 04:06:46.488[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=1, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...fBGqqd
2025-08-08 11:06:46.486	
Aug-08 04:06:46.486[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=2, status=AwaitingDownload, peer=16...aDqVMg
2025-08-08 11:06:46.357	
Aug-08 04:06:46.357[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=1, status=AwaitingDownload, peer=16...fBGqqd
2025-08-08 11:06:44.685	
Aug-08 04:06:44.685[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=0, status=AwaitingDownload, peer=16...aDqVMg
2025-08-08 10:58:16.750	
Aug-08 03:58:16.750[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=4, status=AwaitingProcessing, blocks=0, peer=16...VbgfFx
2025-08-08 10:58:16.750	
Aug-08 03:58:16.750[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=3, status=AwaitingProcessing, blocks=0, peer=16...VbgfFx
2025-08-08 10:58:16.697	
Aug-08 03:58:16.697[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=4, status=AwaitingDownload, peer=16...VbgfFx
2025-08-08 10:58:16.697	
Aug-08 03:58:16.697[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=3, status=AwaitingDownload, peer=16...VbgfFx
2025-08-08 10:58:16.697	
Aug-08 03:58:16.697[sync]          verbose: Batch download error id=Finalized-0, startEpoch=3, status=Downloading, peer=16...VbgfFx, code=REQUEST_ERROR_RESP_TIMEOUT
2025-08-08 10:58:16.627	
Aug-08 03:58:16.627[sync]          verbose: Advanced chain id=Finalized-0, lastEpochWithProcessBlocks=2
2025-08-08 10:58:16.613	
Aug-08 03:58:16.613[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=3, status=AwaitingDownload, peer=16...VbgfFx
2025-08-08 10:58:16.613	
Aug-08 03:58:16.613[sync]          verbose: Batch download error id=Finalized-0, startEpoch=3, status=Downloading, peer=16...VbgfFx, code=REQUEST_ERROR_RESP_TIMEOUT
2025-08-08 10:58:16.527	
Aug-08 03:58:16.527[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=3, status=AwaitingDownload, peer=16...VbgfFx
2025-08-08 10:58:16.527	
Aug-08 03:58:16.527[sync]          verbose: Batch download error id=Finalized-0, startEpoch=3, status=Downloading, peer=16...VbgfFx, code=REQUEST_ERROR_RESP_TIMEOUT
2025-08-08 10:58:16.435	
Aug-08 03:58:16.435[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=3, status=AwaitingDownload, peer=16...VbgfFx
2025-08-08 10:58:16.435	
Aug-08 03:58:16.435[sync]          verbose: Advanced chain id=Finalized-0, lastEpochWithProcessBlocks=1
2025-08-08 10:58:16.110	
Aug-08 03:58:16.110[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=2, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...VbgfFx
2025-08-08 10:58:16.105	
Aug-08 03:58:16.105[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=1, status=AwaitingProcessing, blocks=32, blobs=0, dataColumns=0, peer=16...VbgfFx
2025-08-08 10:58:16.012	
Aug-08 03:58:16.012[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=2, status=AwaitingDownload, peer=16...VbgfFx
2025-08-08 10:58:16.012	
Aug-08 03:58:16.012[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=1, status=AwaitingDownload, peer=16...VbgfFx
2025-08-08 10:58:16.012	
Aug-08 03:58:16.011[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=0, status=AwaitingProcessing, blocks=31, blobs=0, dataColumns=0, peer=16...VbgfFx
2025-08-08 10:58:15.879	
Aug-08 03:58:15.879[sync]          verbose: Downloading batch id=Finalized-0, startEpoch=0, status=AwaitingDownload, peer=16...VbgfFx

Expected behavior

Sync should not stuck

Steps to reproduce

No response

Additional context

No response

Operating system

Linux

Lodestar version or commit hash

peerDAS

twoeths avatar Aug 08 '25 04:08 twoeths

in this case batch with startEpoch=7 was downloaded before batches with startEpoch = 5 and startEpoch = 6

we should only process a batch if we processed the parent block like in UnknownBlockSync or earlier batches processed

if not, leave it in AwaitingProcessing state

twoeths avatar Aug 08 '25 04:08 twoeths

export function getNextBatchToProcess(batches: Batch[]): Batch | null {
  for (const batch of batches) {
    switch (batch.state.status) {
      // If an AwaitingProcessing batch exists it can only be preceeded by AwaitingValidation
      case BatchStatus.AwaitingValidation:
        break;

      case BatchStatus.AwaitingProcessing:
        return batch;

      // There MUST be no AwaitingProcessing state after AwaitingDownload, Downloading, Processing
      case BatchStatus.AwaitingDownload:
      case BatchStatus.Downloading:
      case BatchStatus.Processing:
        return null;
    }
  }
  // Exhausted batches
  return null;
}

it should already works well, the problem is another lodestar node returned 0 block which caused this issue see #8149

  |   | 2025-08-08 11:07:09.405 | Aug-08 04:07:09.405[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=8, status=AwaitingProcessing, blocks=0, peer=16...mAswsc |  
-- | -- | -- | -- | --
  |   | 2025-08-08 11:07:03.581 | Aug-08 04:07:03.581[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=7, status=AwaitingProcessing, blocks=0, peer=16...mAswsc |  
  |   | 2025-08-08 11:07:02.609 | Aug-08 04:07:02.609[sync]            debug: Downloaded batch id=Finalized-0, startEpoch=6, status=AwaitingProcessing, blocks=0, peer=16...mAswsc

so batch with startEpoch = 6 was processed but no movement, I think we should abort the whole chain to try another chain in this case because the current chain is blocked

it means set MAX_BATCH_PROCESSING_ATTEMPTS to 1

twoeths avatar Aug 08 '25 07:08 twoeths

the 0 block batch happens with a lot of clients

  • lighthouse

verbose: Batch download error id=Finalized-2, startEpoch=0, status=Downloading, peer=16...BQ8fLQ - peerId=16Uiu2HAmFsDPzqKFfMuiDTBiRBNALSao7RF2p56TaD4FneBQ8fLQ peerClient=Lighthouse-cgc:4 returns no blocks or allBlobSidecars, blocks=32, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":0,"count":32,"step":1} |  
-- | --
  |   | 2025-08-08 15:46:51.098 | Error: peerId=16Uiu2HAkz9r249RxF7Bt8TNvn1SBgAQ5NimhQs9Hqjho4WAydFWW peerClient=Lighthouse-cgc:9 returns no blocks or allBlobSidecars, blocks=32, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":0,"count":32,"step":1}
  • lodestar
verbose: Batch download error id=Finalized-3, startEpoch=1, status=Downloading, peer=16...3s9tzk - peerId=16Uiu2HAkzLKQvYFcpAWvqtH3wi1oNGvbRK7Zs4uLbT3ScL3s9tzk peerClient=Lodestar-cgc:9 returns no blocks or allBlobSidecars, blocks=0, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":32,"count":32,"step":1} |  
-- | --
  |   | 2025-08-08 15:47:27.659 | Error: peerId=16Uiu2HAmTwQPAwzTr6hTgBmKNecCfH6kP3Kbzxj36ZRyyQ46L6gf peerClient=Grandine-cgc:9 returns no blocks or allBlobSidecars, blocks=31, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":0,"count":32,"step":1}
  • teku
verbose: Batch download error id=Finalized-2, startEpoch=117, status=Downloading, peer=16...aDqVMg - peerId=16Uiu2HAm4MA5vRt6hKoDButucwmUu7ozzJ7FMAuDCovQc6aDqVMg peerClient=Teku-cgc:128 returns no blocks allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":3744,"count":32,"step":1} |  
-- | --
  |   | 2025-08-08 15:51:07.305 | Error: peerId=16Uiu2HAm5DeH9EEx9HiAhTndqiipZgJWjVefjU6gEBTG85oSbYDd peerClient=Lighthouse-cgc:9 returns no blocks or allBlobSidecars, blocks=21, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":0,"count":32,"step":1}
  • nimbus
Aug-08 08:51:07.117[sync]          verbose: Batch download error id=Finalized-12, startEpoch=1, status=Downloading, peer=16...nBiz6e - peerId=16Uiu2HAmMo4wfCrhELRvHxcnM84wjKuNpZjYZSXHqB1LtfnBiz6e peerClient=Nimbus-cgc:128 returns no blocks or allBlobSidecars, blocks=32, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":32,"count":32,"step":1} |  
-- | --
  |   | 2025-08-08 15:51:06.995 | Error: peerId=16Uiu2HAm9CkczPnteM79gszVh2R3XhcgxZvRE5aqjcDpVNQ3V4NH peerClient=Nimbus-cgc:4 returns no blocks or allBlobSidecars, blocks=0, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":64,"count":32,"step":1}
  • grandine
verbose: Batch download error id=Finalized-8, startEpoch=0, status=Downloading, peer=16...gjzNA8 - peerId=16Uiu2HAmMDB5uUePTpN7737m78ehePfWPtBL9qMGdH8kCygjzNA8 peerClient=Grandine-cgc:128 returns no blocks or allBlobSidecars, blocks=31, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":0,"count":32,"step":1} |  
-- | --
  |   | 2025-08-08 15:50:00.299 | Error: peerId=16Uiu2HAmFsDPzqKFfMuiDTBiRBNALSao7RF2p56TaD4FneBQ8fLQ peerClient=Lighthouse-cgc:128 returns no blocks or allBlobSidecars, blocks=32, allBlobSidecars=0 for BeaconBlocksByRangeRequest {"startSlot":0,"count":32,"step":1}

looks like prysm does not have this issue

twoeths avatar Aug 08 '25 09:08 twoeths

prysm also return 0 block

Error: peerId=16Uiu2HAm4bkvXdyzkh1wjRPXSXJmYkze2Xr9ZcXnCjm5PsBExJi3 peerClient=Prysm-cgc:128 returns no blocks dataColumnSidecars=3584 for BeaconBlocksByRangeRequest {"startSlot":22304,"count":32,"step":1}

twoeths avatar Aug 08 '25 10:08 twoeths

Open this issue back up. Is possible that we may not have been respecting earliestAvaliableSlot or something else. Its possible for a ByRange request to have 0 blocks in it during a period where chain liveness is an issue an there is an epoch with 0 blocks.

Reopening so we can revisit this when looking at sync again.

matthewkeil avatar Sep 02 '25 18:09 matthewkeil

we already check earliestAvailableSlot in #8150

twoeths avatar Sep 03 '25 02:09 twoeths

Its possible for a ByRange request to have 0 blocks in it during a period where chain liveness is an issue an there is an epoch with 0 blocks.

as far as I know, there is no epoch with 0 blocks on any networks (if yes please specify @matthewkeil )

the issue is some clients return 0 blocks incorrectly (including lodestar) which made the sync stuck on devnets because we trust the result. With throwing an error, SyncChain will have a chance to retry with other peers serving us data

also instead of returning 0 blocks, peer should throw RESOURCE_UNAVAILABLE instead according to the spec, if not we throw error instead as in #8150

twoeths avatar Sep 12 '25 06:09 twoeths

as far as I know, there is no epoch with 0 blocks on any networks (if yes please specify @matthewkeil )

on holesky this was the case, we had ~800 slots without any blocks in a row

nflaig avatar Sep 12 '25 09:09 nflaig

Its possible for a ByRange request to have 0 blocks in it during a period where chain liveness is an issue an there is an epoch with 0 blocks.

Exactly as @nflaig mentioned. In a period without liveness it's possible to have an epoch with 0 blocks

matthewkeil avatar Sep 12 '25 12:09 matthewkeil