lighthouse Sync stuck on execution engine delayed restart

Description

I had a Geth + Lighthouse combo running (n sync) on the Sepolia testnet. Stopped Geth for a couple hours, but left Lighthouse running (and printing errors). A few hours later I restarted Geth, but Lighthouse (or the combo) seems to be stuck in a bad state as neither progresses, and I assume both is waiting for the other?

On the Geth side, we have a clean startup:

$ geth --sepolia console
INFO [07-29|19:01:55.695] Starting Geth on Sepolia testnet... 
[...]
INFO [07-29|19:01:55.914] Loaded most recent local header          number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local full block      number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local fast block      number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local finalized block number=1,599,146 hash=e7c859..adfb33 td=17,000,018,015,853,232 age=2h53m7s
[...]
// And stuck here, occasional messages, logging the engine API calls, I only get the occasional transition config exchange call.

On the Lighthouse side, continuing from a previous stream of errors due to Geth being offline:

Jul 29 16:01:46.284 ERRO Error during execution engine upcheck   error: Reqwest(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(8551), path: "/", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })) }), service: exec
Jul 29 16:01:54.001 INFO Searching for peers                     current_slot: 281409, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:01:57.308 INFO Execution engine online                 service: exec
Jul 29 16:01:57.309 INFO Issuing forkchoiceUpdated               forkchoice_state: ForkChoiceState { head_block_hash: ExecutionBlockHash(0xc109a1d4196159380ce942733ba303ee0ba659a309db0ff9943bb547ee097c27), safe_block_hash: ExecutionBlockHash(0x54f3b86a04f14ef5319b9620bb84a54d54e00fcce81b8a23bdf2fd532405377d), finalized_block_hash: ExecutionBlockHash(0xe7c8594c9eef276dc86cb5d6fbaa336e488aa65b037af6ae08edcdcd11adfb33) }, service: exec
Jul 29 16:02:06.001 INFO Searching for peers                     current_slot: 281410, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:18.000 INFO Searching for peers                     current_slot: 281411, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:30.000 INFO Searching for peers                     current_slot: 281412, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:42.001 INFO Searching for peers                     current_slot: 281413, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:54.001 INFO Searching for peers                     current_slot: 281414, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:03:06.001 INFO Searching for peers                     current_slot: 281415, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:03:18.001 INFO Searching for peers                     current_slot: 281416, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 3, service: slot_notifier
Jul 29 16:03:30.001 INFO Searching for peers
[...]
// Repeating these idefinitely, with progressing block numbers, with the occational

Jul 29 17:13:55.829 INFO Sync state updated                      new_state: Evaluating known peers, old_state: Stalled, service: sync
Jul 29 17:13:55.829 INFO Sync state updated                      new_state: Stalled, old_state: Evaluating known peers, service: sync

You can find the full stream of Geth and Lighthouse logs across various restarts at https://gist.github.com/karalabe/6a87ebc2c2f525e2690f13e74d9c311b

Version

stable

Present Behaviour

The only ever actual update sent to the execution engine seems to be based on this log:

Jul 29 16:01:57.309 INFO Issuing forkchoiceUpdated               forkchoice_state: ForkChoiceState { head_block_hash: ExecutionBlockHash(0xc109a1d4196159380ce942733ba303ee0ba659a309db0ff9943bb547ee097c27), safe_block_hash: ExecutionBlockHash(0x54f3b86a04f14ef5319b9620bb84a54d54e00fcce81b8a23bdf2fd532405377d), finalized_block_hash: ExecutionBlockHash(0xe7c8594c9eef276dc86cb5d6fbaa336e488aa65b037af6ae08edcdcd11adfb33) }, service: exec

This forkchoice update is a noop for Geth, since it's already on this exact block, as signaled by the startup logs:

INFO [07-29|19:01:55.914] Loaded most recent local header          number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local full block      number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local fast block      number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local finalized block number=1,599,146 hash=e7c859..adfb33 td=17,000,018,015,853,232 age=2h53m7s

No further NewPayload or ForkchoiceUpdates are sent from Lighthouse to Geth, thus it cannot do anything. Lighthouse seems to not want to do anything either, waiting for Geth to do something?

Expected Behaviour

Lighthouse should detect that Geth and itself are in sync and should send the next block to Geth.

Steps to resolve

A) I've left in in this state for a few hours, it did not resolve by itself.

B) I've restarted Geth after a number of hours, Lighthouse now rejects Geth as being too far behind (well duh, waiting for you dumdum)

Jul 29 17:23:19.103 CRIT Could not connect to a suitable eth1 node. Please ensure that you have an eth1 http server running locally on http://localhost:8545 or specify one or more (remote) endpoints using `--eth1-endpoints <COMMA-SEPARATED-SERVER-ADDRESSES>`. Also ensure that `eth` and `net` apis are enabled on the eth1 http server, warning: BLOCK PROPOSALS WILL FAIL WITHOUT VALID, SYNCED ETH1 CONNECTION, service: eth1_rpc
Jul 29 17:23:19.103 ERRO Failed to update eth1 cache             error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8551/, auth=true => EndpointError(FarBehind)", retry_millis: 60000, service: eth1_rpc

C) Restarting Lighthouse starts sync properly

Jul 29 17:25:58.323 INFO Sync state updated                      new_state: Syncing Finalized Chain, old_state: Stalled, service: sync
Jul 29 17:26:06.001 INFO Syncing                                 est_time: --, distance: 966 slots (3 hrs 13 mins), peers: 2, service: slot_notifier

i.e. Seems Lighthouse went into some weird state that is not natural, since a restart instantly started syncing without issues with the same Geth execution client that it connected to multiple times before and always rejected.

Jul 29 '22 17:07 karalabe

@karalabe We shouldn't be loosing peers even if the execution engine is offline. Could you share the debug logs for lighthouse as well? It should be in $datadir/beacon/logs/beacon.log

Jul 29 '22 19:07 pawanjay176

Okay I am able to reproduce the issue locally, will check it out.

Jul 29 '22 20:07 pawanjay176

Thanks for the report! The issue is that we are trying to sync even when the execution layer is offline which leads to a bunch of peers getting banned. We had a PR where we removed all peer scoring on execution engine errors but looks like we didn't cover all the cases.

We ideally shouldn't be syncing when execution layer is offline, we have a WIP PR here that is trying to fix that https://github.com/sigp/lighthouse/pull/3094 which was blocked on some other issues. I think there are no more blockers for that PR now so we should hopefully have that fix out soon.

Jul 29 '22 23:07 pawanjay176

Thanks for reporting @karalabe and diagnosing @pawanjay176 :pray:

Lighthouse now rejects Geth as being too far behind

FYI, Lighthouse is not rejecting Geth as a CL here. It's failing to sync the deposit contract logs which is a totally separate process to anything merge-related. The issue was LH losing all its peers due to mishandling of the offline EL; a restart allowed it to find peers again.

@pawanjay176 my understanding is that you fixed the peer downscoring issue in #3384 and we're tracking the "pause sync whilst EL offline" issue in #3032 and #3094. If that's the case then I think we're safe to close this.

Aug 01 '22 00:08 paulhauner

@paulhauner the peer downscoring issue is something specific to sync where we ban all peers in a chain if batch processing of that range of blocks fails multiple times (which is what happened when the EE is offline for a while). #3384 fixed a part of the problem but not entirely.

I think it's okay to keep this issue open while we fix this in #3094 and other smaller PRs.

Aug 02 '22 02:08 pawanjay176

Should be solved after #3439 and #3032. Feel free to reopen if the issue persists

Sep 07 '22 16:09 divagant-martian