nimbus-eth2 Exec Layer node getLogs requests stop after EL restart

Describe the bug While working on Geth upgrade for a production Beacon Node running v24.2.0-742f15 a weird behavior was discovered when the Geth node was restarted.

The getLogs requests stopped from the beacon node:

And the eth1_chain_len metric stopped raising.

When the BN was restarted the getLogs requests resumed. This behavior was reproduced when Geth was restarted again:

Additional context Logs can be provided upon request from the fleet.

Mar 08 '24 12:03 jakubgs

This node is outdated and must be updated for Nimbus >= v24.2.1 before Deneb (5 days deadline).

Mar 08 '24 12:03 etan-status

Good point. We just finished upgrading EL nodes, will do BN next.

Mar 08 '24 12:03 jakubgs

Searching the logs for Failed to obtain the latest block from the EL is only logged one time. After that, there are still periodic syncEth1Chain tick that start new eth_getBlockByNumber('latest', false) requests, but these subsequent ones never seem to finish / time out / fail.

    debug "syncEth1Chain tick"

    if bnStatus == BeaconNodeStatus.Stopping:
      await m.stop()
      return

    if m.eth1Chain.hasConsensusViolation:
      raise newException(CorruptDataProvider, "Eth1 chain contradicts Eth2 consensus")

    let latestBlock = try:
      raiseIfNil connection.trackedRequestWithTimeout(
        "getBlockByNumber",
        rpcClient.eth_getBlockByNumber(blockId("latest"), false),
        web3RequestsTimeout)
    except CatchableError as err:
      warn "Failed to obtain the latest block from the EL", err = err.msg
      raise err

bnStatus == BeaconNodeStatus.Stopping cannot be because that is only set on exit
m.eth1Chain.hasConsensusViolation cannot be because the locations that set it to true produce a unique log line
connection.trackedRequestWithTimeout is still repeatedly hit, as seen by Sending message to RPC server log line immediately after each syncEth1Chain tick log

Maybe something weird in trackedRequestWithTimeout related to or that makes it get stuck:

https://github.com/status-im/nim-chronos/issues/516

Mar 08 '24 13:03 etan-status

Do you have eth1_latest_head metric from that time as well? And also eth1_synced_head

Mar 08 '24 13:03 etan-status

Extended the logs here as well to have a better understanding what's going on:

https://github.com/status-im/nimbus-eth2/pull/6049

Mar 08 '24 13:03 etan-status

Do you have eth1_latest_head metric from that time as well?

Not on the dashboard but it can be generated:

And also eth1_synced_head

Mar 08 '24 14:03 jakubgs

Multi-EL is documented officially without syaing it's unstable/beta: https://nimbus.guide/eth1.html#running-multiple-execution-clients

@etan-status, @tersec, @jakubgs, maybe we can try running multi-EL again?

BTW, we are running multi-EL on nimbus.mainnet: https://github.com/status-im/infra-nimbus/blob/42cfc2958a39cc6ec316f1f46d45189cf9ff995f/ansible/group_vars/nimbus.mainnet.yml#L81

Aug 20 '24 10:08 yakimant

As far as I know multi-EL setups work fine on Holesky, but I didn't research it that closely. The question is more for Nimbus devs.

Aug 21 '24 17:08 jakubgs

nimbus-eth2 nimbus-eth2 copied to clipboard

Exec Layer node getLogs requests stop after EL restart

nimbus-eth2
nimbus-eth2 copied to clipboard