lighthouse
lighthouse copied to clipboard
Historical block download doesn't realize it's finished
Description
After historical block download completes, it continues showing:
May 25 02:16:22.500 INFO Downloading historical blocks est_time: --, speed: 19.25 slots/sec, distance: 0 slots (--), service: slot_notifier
And /eth/v1/node/health returns 206.
After restarting Lighthouse, it returns 200, and the historical block log no longer shows.
This is with execution not yet done syncing, while Lighthouse is done. Unsure whether it's related to execution sync state at all.
Version
4.2.0
Thanks for the report.
Will be fixed in the next release :)
Unfortunately this didn't make it into our v4.3.0 release :( We're being pressured to get v4.3.0 out for a Gnosis upgrade and we just couldn't get this one done in time. We're keen to get it fixed in the following release, though!
Closed by #4465
@jmcph4 @michaelsproul I'm observing the same behavior as documented in #4346 with 4.4.1 after a fresh checkpoint sync:
Sep 12 06:59:05 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:05.332 ERRO Backfill sync failed error: InvalidSyncState("Batch not found for current processing target 195168"), service: sync
Sep 12 06:59:05 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:05.332 INFO Sync state updated new_state: Synced, old_state: Syncing Historical Blocks, service: sync
Sep 12 06:59:05 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:05.809 INFO Sync state updated new_state: Syncing Historical Blocks, old_state: Synced, service: sync
Sep 12 06:59:11 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:11.828 INFO New block received root: 0x5b9add3620e8d053d78f271745c09fe31e9f9dc547d63979f4f24f09c5889742, slot: 7305894
Sep 12 06:59:17 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:17.001 INFO Synced slot: 7305894, block: 0x5b9a…9742, epoch: 228309, finalized_epoch: 228307, finalized_root: 0x4ef8…9bd4, exec_hash: 0xfb04…701f (verified), peers: 88, service: slot_notifier
Sep 12 06:59:17 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:17.321 ERRO Backfill out of sync with beacon chain service: sync
Sep 12 06:59:17 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:17.321 ERRO Backfill sync failed error: InvalidSyncState("Batch not found for current processing target 195168"), service: sync
Sep 12 06:59:17 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:17.321 INFO Sync state updated new_state: Synced, old_state: Syncing Historical Blocks, service: sync
Sep 12 06:59:24 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:24.621 INFO New block received root: 0x7fc1263c50f2e0bd8ea4af84f23cf52e95802f8acca65b65c00985b8105bf415, slot: 7305895
Sep 12 06:59:29 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:29.001 INFO Historical block download complete service: slot_notifier
Sep 12 06:59:29 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:29.001 INFO Synced slot: 7305895, block: 0x7fc1…f415, epoch: 228309, finalized_epoch: 228307, finalized_root: 0x4ef8…9bd4, exec_hash: 0x056d…1af3 (verified), peers: 88, service: slot_notifier
Sep 12 06:59:30 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:30.907 INFO Sync state updated new_state: Syncing Historical Blocks, old_state: Synced, service: sync
Sep 12 06:59:37 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:37.267 INFO New block received root: 0xd1669139a3b6e85943a6c17a7f7d0a4115c750e9e37c9437823f0356c93ebb22, slot: 7305896
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.001 INFO Downloading historical blocks est_time: --, distance: 1 slots (0 mins), service: slot_notifier
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.001 INFO Synced slot: 7305896, block: 0xd166…bb22, epoch: 228309, finalized_epoch: 228307, finalized_root: 0x4ef8…9bd4, exec_hash: 0xbcc0…61f1 (verified), peers: 86, service: slot_notifier
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.316 ERRO Backfill out of sync with beacon chain service: sync
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.316 ERRO Backfill sync failed error: InvalidSyncState("Batch not found for current processing target 195168"), service: sync
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.317 ERRO Backfill sync failed error: InvalidSyncState("Batch not found for current processing target 195168"), service: sync
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.317 INFO Sync state updated new_state: Synced, old_state: Syncing Historical Blocks, service: sync
Do you have any recommendations on how to get unstuck? Thanks!
Do you have any recommendations on how to get unstuck? Thanks!
Restarting the node fixed it previously, try that.
Would you also mind sharing some debug logs so we can investigate and fix the root cause? You can find debug logs in $datadir/beacon/logs/beacon.log. If you compress those and DM them to me on Discord (@sproul) or send them via email (michael at sigmaprime.io) that would be great. Thanks!
I'm reopening this because there are still regular reports of users hitting this with Lighthouse v4.5.0.
I think a good way to debug it would be to allow artificially shortening the backfill distance, so that we can quickly iterate on the completion of backfill sync. I suspect this will repro the bug quite frequently.
Logs from a user on Discord:
Dec 11 06:53:56.867 ERRO Backfill out of sync with beacon chain, service: sync, module: network::sync::backfill_sync:1151 Dec 11 06:53:56.867 ERRO Backfill sync failed, error: InvalidSyncState("Batch not found for current processing target 206898"), service: sync, module: network::sync::backfill_sync:492 Dec 11 06:53:56.867 ERRO Backfill sync failed, error: InvalidSyncState("Batch not found for current processing target 206898"), service: sync, module: network::sync::manager:573
Unfortunately I don't have logs from this user triggering the bug in the first place, I got these logs to debug another issue.
Saw this on v5.0.0-b5bae6e on Goerli. Restart resolved it
consensus-1 | Mar 06 11:10:42.002 ERRO Backfill out of sync with beacon chain service: sync
consensus-1 | Mar 06 11:10:42.002 ERRO Backfill sync failed error: InvalidSyncState("Batch not found for current processing target 209216"), service: sync
consensus-1 | Mar 06 11:10:42.002 ERRO Backfill sync failed error: InvalidSyncState("Batch not found for current processing target 209216"), service: sync
if someone wants to work on this, the problem is that a batch does not have epoch boundary blocks (so basically it believes it has back to slot x which belongs to epoch Y, since Y belongs to the downloaded batch, but it doesn't - the peer did not provide it). This is the issue about being "out of sync".
The second issue is that, trying to solve this "out of sync", backfill gets reset to a different epoch. However this is not correctly calculated since it does not align with previous batches. Say batches are for epochs 10,8,6 and then it gets reset to 7 or 5, this is the "batch not found".
Also, post deneb, having single epoch batches will most likely mask the issue (second one, at least)
Due to personal life situations I don't have the time to tackle this myself
if someone wants to work on this, the problem is that a batch does not have epoch boundary blocks (so basically it believes it has back to slot x which belongs to epoch Y, since Y belongs to the downloaded batch, but it doesn't - the peer did not provide it). This is the issue about being "out of sync".
Hmm, we can not know whether the epoch boundary block is missing because it is actually an empty slot in the canonical chain, or because the peer that sent that last batch omitted it, right? That would mean that the (IMO) correct behaviour here would be to continue requesting batches until we find the parent block of the last received block (proving that the slot we are backfilling to is actually empty) or determine that we should redownload the last received batch as it might be missing that block.
I considered accepting that the last batch might be missing blocks as an alternative, but I don't think that this is a good idea as this would actually mean that the sync would accept any number of empty batches at the end of the sync process.
In case my findings and conclusions are correct, I would like to attempt to fix this :)
The second issue is that, trying to solve this "out of sync", backfill gets reset to a different epoch. However this is not correctly calculated since it does not align with previous batches. Say batches are for epochs 10,8,6 and then it gets reset to 7 or 5, this is the "batch not found".
Also, post deneb, having single epoch batches will most likely mask the issue (second one, at least)
@yorickdowne's latest example seems to come from post-deneb Goerli, so I am not sure if that is the actual cause, but I have not found an alternative explanation yet.