lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Historical block download doesn't realize it's finished

Open yorickdowne opened this issue 2 years ago • 10 comments

Description

After historical block download completes, it continues showing:

May 25 02:16:22.500 INFO Downloading historical blocks est_time: --, speed: 19.25 slots/sec, distance: 0 slots (--), service: slot_notifier

And /eth/v1/node/health returns 206.

After restarting Lighthouse, it returns 200, and the historical block log no longer shows.

This is with execution not yet done syncing, while Lighthouse is done. Unsure whether it's related to execution sync state at all.

Version

4.2.0

yorickdowne avatar May 25 '23 02:05 yorickdowne

Thanks for the report.

Will be fixed in the next release :)

AgeManning avatar May 25 '23 04:05 AgeManning

Unfortunately this didn't make it into our v4.3.0 release :( We're being pressured to get v4.3.0 out for a Gnosis upgrade and we just couldn't get this one done in time. We're keen to get it fixed in the following release, though!

paulhauner avatar Jun 30 '23 05:06 paulhauner

Closed by #4465

jmcph4 avatar Jul 07 '23 04:07 jmcph4

@jmcph4 @michaelsproul I'm observing the same behavior as documented in #4346 with 4.4.1 after a fresh checkpoint sync:

Sep 12 06:59:05 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:05.332 ERRO Backfill sync failed                    error: InvalidSyncState("Batch not found for current processing target 195168"), service: sync
Sep 12 06:59:05 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:05.332 INFO Sync state updated                      new_state: Synced, old_state: Syncing Historical Blocks, service: sync
Sep 12 06:59:05 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:05.809 INFO Sync state updated                      new_state: Syncing Historical Blocks, old_state: Synced, service: sync
Sep 12 06:59:11 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:11.828 INFO New block received                      root: 0x5b9add3620e8d053d78f271745c09fe31e9f9dc547d63979f4f24f09c5889742, slot: 7305894
Sep 12 06:59:17 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:17.001 INFO Synced                                  slot: 7305894, block: 0x5b9a…9742, epoch: 228309, finalized_epoch: 228307, finalized_root: 0x4ef8…9bd4, exec_hash: 0xfb04…701f (verified), peers: 88, service: slot_notifier
Sep 12 06:59:17 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:17.321 ERRO Backfill out of sync with beacon chain  service: sync
Sep 12 06:59:17 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:17.321 ERRO Backfill sync failed                    error: InvalidSyncState("Batch not found for current processing target 195168"), service: sync
Sep 12 06:59:17 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:17.321 INFO Sync state updated                      new_state: Synced, old_state: Syncing Historical Blocks, service: sync
Sep 12 06:59:24 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:24.621 INFO New block received                      root: 0x7fc1263c50f2e0bd8ea4af84f23cf52e95802f8acca65b65c00985b8105bf415, slot: 7305895
Sep 12 06:59:29 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:29.001 INFO Historical block download complete      service: slot_notifier
Sep 12 06:59:29 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:29.001 INFO Synced                                  slot: 7305895, block: 0x7fc1…f415, epoch: 228309, finalized_epoch: 228307, finalized_root: 0x4ef8…9bd4, exec_hash: 0x056d…1af3 (verified), peers: 88, service: slot_notifier
Sep 12 06:59:30 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:30.907 INFO Sync state updated                      new_state: Syncing Historical Blocks, old_state: Synced, service: sync
Sep 12 06:59:37 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:37.267 INFO New block received                      root: 0xd1669139a3b6e85943a6c17a7f7d0a4115c750e9e37c9437823f0356c93ebb22, slot: 7305896
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.001 INFO Downloading historical blocks           est_time: --, distance: 1 slots (0 mins), service: slot_notifier
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.001 INFO Synced                                  slot: 7305896, block: 0xd166…bb22, epoch: 228309, finalized_epoch: 228307, finalized_root: 0x4ef8…9bd4, exec_hash: 0xbcc0…61f1 (verified), peers: 86, service: slot_notifier
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.316 ERRO Backfill out of sync with beacon chain  service: sync
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.316 ERRO Backfill sync failed                    error: InvalidSyncState("Batch not found for current processing target 195168"), service: sync
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.317 ERRO Backfill sync failed                    error: InvalidSyncState("Batch not found for current processing target 195168"), service: sync
Sep 12 06:59:41 v2202209182301199785 lighthouse[3450]: Sep 12 04:59:41.317 INFO Sync state updated                      new_state: Synced, old_state: Syncing Historical Blocks, service: sync

Do you have any recommendations on how to get unstuck? Thanks!

elsesiy avatar Sep 12 '23 05:09 elsesiy

Do you have any recommendations on how to get unstuck? Thanks!

Restarting the node fixed it previously, try that.

Would you also mind sharing some debug logs so we can investigate and fix the root cause? You can find debug logs in $datadir/beacon/logs/beacon.log. If you compress those and DM them to me on Discord (@sproul) or send them via email (michael at sigmaprime.io) that would be great. Thanks!

michaelsproul avatar Sep 12 '23 05:09 michaelsproul

I'm reopening this because there are still regular reports of users hitting this with Lighthouse v4.5.0.

I think a good way to debug it would be to allow artificially shortening the backfill distance, so that we can quickly iterate on the completion of backfill sync. I suspect this will repro the bug quite frequently.

Logs from a user on Discord:

Dec 11 06:53:56.867 ERRO Backfill out of sync with beacon chain, service: sync, module: network::sync::backfill_sync:1151 Dec 11 06:53:56.867 ERRO Backfill sync failed, error: InvalidSyncState("Batch not found for current processing target 206898"), service: sync, module: network::sync::backfill_sync:492 Dec 11 06:53:56.867 ERRO Backfill sync failed, error: InvalidSyncState("Batch not found for current processing target 206898"), service: sync, module: network::sync::manager:573

michaelsproul avatar Dec 12 '23 04:12 michaelsproul

Unfortunately I don't have logs from this user triggering the bug in the first place, I got these logs to debug another issue.

michaelsproul avatar Dec 12 '23 04:12 michaelsproul

Saw this on v5.0.0-b5bae6e on Goerli. Restart resolved it

consensus-1  | Mar 06 11:10:42.002 ERRO Backfill out of sync with beacon chain  service: sync
consensus-1  | Mar 06 11:10:42.002 ERRO Backfill sync failed                    error: InvalidSyncState("Batch not found for current processing target 209216"), service: sync
consensus-1  | Mar 06 11:10:42.002 ERRO Backfill sync failed                    error: InvalidSyncState("Batch not found for current processing target 209216"), service: sync

yorickdowne avatar Mar 06 '24 11:03 yorickdowne

if someone wants to work on this, the problem is that a batch does not have epoch boundary blocks (so basically it believes it has back to slot x which belongs to epoch Y, since Y belongs to the downloaded batch, but it doesn't - the peer did not provide it). This is the issue about being "out of sync".

The second issue is that, trying to solve this "out of sync", backfill gets reset to a different epoch. However this is not correctly calculated since it does not align with previous batches. Say batches are for epochs 10,8,6 and then it gets reset to 7 or 5, this is the "batch not found".

Also, post deneb, having single epoch batches will most likely mask the issue (second one, at least)

Due to personal life situations I don't have the time to tackle this myself

divagant-martian avatar Mar 06 '24 14:03 divagant-martian

if someone wants to work on this, the problem is that a batch does not have epoch boundary blocks (so basically it believes it has back to slot x which belongs to epoch Y, since Y belongs to the downloaded batch, but it doesn't - the peer did not provide it). This is the issue about being "out of sync".

Hmm, we can not know whether the epoch boundary block is missing because it is actually an empty slot in the canonical chain, or because the peer that sent that last batch omitted it, right? That would mean that the (IMO) correct behaviour here would be to continue requesting batches until we find the parent block of the last received block (proving that the slot we are backfilling to is actually empty) or determine that we should redownload the last received batch as it might be missing that block.

I considered accepting that the last batch might be missing blocks as an alternative, but I don't think that this is a good idea as this would actually mean that the sync would accept any number of empty batches at the end of the sync process.

In case my findings and conclusions are correct, I would like to attempt to fix this :)

The second issue is that, trying to solve this "out of sync", backfill gets reset to a different epoch. However this is not correctly calculated since it does not align with previous batches. Say batches are for epochs 10,8,6 and then it gets reset to 7 or 5, this is the "batch not found".

Also, post deneb, having single epoch batches will most likely mask the issue (second one, at least)

@yorickdowne's latest example seems to come from post-deneb Goerli, so I am not sure if that is the actual cause, but I have not found an alternative explanation yet.

dknopik avatar Mar 31 '24 23:03 dknopik