nimbus-eth2
nimbus-eth2 copied to clipboard
Exec Layer node getLogs requests stop after EL restart
Describe the bug
While working on Geth upgrade for a production Beacon Node running v24.2.0-742f15
a weird behavior was discovered when the Geth node was restarted.
The getLogs
requests stopped from the beacon node:
And the eth1_chain_len
metric stopped raising.
When the BN was restarted the getLogs
requests resumed. This behavior was reproduced when Geth was restarted again:
Additional context Logs can be provided upon request from the fleet.
This node is outdated and must be updated for Nimbus >= v24.2.1 before Deneb (5 days deadline).
Good point. We just finished upgrading EL nodes, will do BN next.
Searching the logs for Failed to obtain the latest block from the EL
is only logged one time. After that, there are still periodic syncEth1Chain tick
that start new eth_getBlockByNumber('latest', false)
requests, but these subsequent ones never seem to finish / time out / fail.
debug "syncEth1Chain tick"
if bnStatus == BeaconNodeStatus.Stopping:
await m.stop()
return
if m.eth1Chain.hasConsensusViolation:
raise newException(CorruptDataProvider, "Eth1 chain contradicts Eth2 consensus")
let latestBlock = try:
raiseIfNil connection.trackedRequestWithTimeout(
"getBlockByNumber",
rpcClient.eth_getBlockByNumber(blockId("latest"), false),
web3RequestsTimeout)
except CatchableError as err:
warn "Failed to obtain the latest block from the EL", err = err.msg
raise err
-
bnStatus == BeaconNodeStatus.Stopping
cannot be because that is only set on exit -
m.eth1Chain.hasConsensusViolation
cannot be because the locations that set it totrue
produce a unique log line -
connection.trackedRequestWithTimeout
is still repeatedly hit, as seen bySending message to RPC server
log line immediately after eachsyncEth1Chain tick
log
Maybe something weird in trackedRequestWithTimeout
related to or
that makes it get stuck:
- https://github.com/status-im/nim-chronos/issues/516
Do you have eth1_latest_head
metric from that time as well? And also eth1_synced_head
Extended the logs here as well to have a better understanding what's going on:
- https://github.com/status-im/nimbus-eth2/pull/6049
Do you have eth1_latest_head metric from that time as well?
Not on the dashboard but it can be generated:
And also eth1_synced_head
Multi-EL is documented officially without syaing it's unstable/beta: https://nimbus.guide/eth1.html#running-multiple-execution-clients
@etan-status, @tersec, @jakubgs, maybe we can try running multi-EL again?
BTW, we are running multi-EL on nimbus.mainnet: https://github.com/status-im/infra-nimbus/blob/42cfc2958a39cc6ec316f1f46d45189cf9ff995f/ansible/group_vars/nimbus.mainnet.yml#L81
As far as I know multi-EL setups work fine on Holesky, but I didn't research it that closely. The question is more for Nimbus devs.