nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

Exec Layer node getLogs requests stop after EL restart

Open jakubgs opened this issue 11 months ago • 8 comments

Describe the bug While working on Geth upgrade for a production Beacon Node running v24.2.0-742f15 a weird behavior was discovered when the Geth node was restarted.

The getLogs requests stopped from the beacon node:

image

And the eth1_chain_len metric stopped raising.

image

When the BN was restarted the getLogs requests resumed. This behavior was reproduced when Geth was restarted again:

image

Additional context Logs can be provided upon request from the fleet.

jakubgs avatar Mar 08 '24 12:03 jakubgs

This node is outdated and must be updated for Nimbus >= v24.2.1 before Deneb (5 days deadline).

etan-status avatar Mar 08 '24 12:03 etan-status

Good point. We just finished upgrading EL nodes, will do BN next.

jakubgs avatar Mar 08 '24 12:03 jakubgs

Searching the logs for Failed to obtain the latest block from the EL is only logged one time. After that, there are still periodic syncEth1Chain tick that start new eth_getBlockByNumber('latest', false) requests, but these subsequent ones never seem to finish / time out / fail.

    debug "syncEth1Chain tick"

    if bnStatus == BeaconNodeStatus.Stopping:
      await m.stop()
      return

    if m.eth1Chain.hasConsensusViolation:
      raise newException(CorruptDataProvider, "Eth1 chain contradicts Eth2 consensus")

    let latestBlock = try:
      raiseIfNil connection.trackedRequestWithTimeout(
        "getBlockByNumber",
        rpcClient.eth_getBlockByNumber(blockId("latest"), false),
        web3RequestsTimeout)
    except CatchableError as err:
      warn "Failed to obtain the latest block from the EL", err = err.msg
      raise err
  • bnStatus == BeaconNodeStatus.Stopping cannot be because that is only set on exit
  • m.eth1Chain.hasConsensusViolation cannot be because the locations that set it to true produce a unique log line
  • connection.trackedRequestWithTimeout is still repeatedly hit, as seen by Sending message to RPC server log line immediately after each syncEth1Chain tick log

Maybe something weird in trackedRequestWithTimeout related to or that makes it get stuck:

  • https://github.com/status-im/nim-chronos/issues/516

etan-status avatar Mar 08 '24 13:03 etan-status

Do you have eth1_latest_head metric from that time as well? And also eth1_synced_head

etan-status avatar Mar 08 '24 13:03 etan-status

Extended the logs here as well to have a better understanding what's going on:

  • https://github.com/status-im/nimbus-eth2/pull/6049

etan-status avatar Mar 08 '24 13:03 etan-status

Do you have eth1_latest_head metric from that time as well?

Not on the dashboard but it can be generated:

image

And also eth1_synced_head

image

jakubgs avatar Mar 08 '24 14:03 jakubgs

Multi-EL is documented officially without syaing it's unstable/beta: https://nimbus.guide/eth1.html#running-multiple-execution-clients

@etan-status, @tersec, @jakubgs, maybe we can try running multi-EL again?

BTW, we are running multi-EL on nimbus.mainnet: https://github.com/status-im/infra-nimbus/blob/42cfc2958a39cc6ec316f1f46d45189cf9ff995f/ansible/group_vars/nimbus.mainnet.yml#L81

yakimant avatar Aug 20 '24 10:08 yakimant

As far as I know multi-EL setups work fine on Holesky, but I didn't research it that closely. The question is more for Nimbus devs.

jakubgs avatar Aug 21 '24 17:08 jakubgs