lodestar icon indicating copy to clipboard operation
lodestar copied to clipboard

Lodestar sends same UNAVAILABLE block to EL at-infinitum

Open dapplion opened this issue 2 years ago • 4 comments

Describe the bug

In gc-merge-devnet-1 incident, Lodestar spent hours hammering the same block over and over to Nethermind. @g11tech is this expected behaviour?

  • More context https://github.com/gnosischain/consensus-deployment-ansible/issues/9
  • Logs from ssh [email protected] merge-devnet-1-lodestar-nethermind-1

Also metrics like lodestar_execution_engine_http_client_request_errors_total are not collected, review

Longer chunk of logs: lodestar_logs_47316_block.txt

Jul-03 06:31:46.129[CHAIN]           error: Block error slot=47311, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:31:58.417[CHAIN]           error: Block error slot=47311, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:32:04.396[CHAIN]           error: Block error slot=47311, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=ELERROR, errorMessage=JSON RPC error: Authentication e
rror, engine_newPayloadV1
Jul-03 06:32:25.161[CHAIN]           error: Error pushing notifyForkchoiceUpdate() headBlockHash=0x9d150824683912b2dcffee609d70892ca0e254eb5d0a95fbb3af760b81a6615c, finalizedBlockHash=0xc8d43bc8019581a21
77a99cbbee1e6764d4d5551a9f0b8051dca756d9916e334 Timeout request
Jul-03 06:32:26.511[CHAIN]           error: Block error slot=47313, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:32:38.737[CHAIN]           error: Block error slot=47313, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:32:51.170[CHAIN]           error: Block error slot=47313, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:33:04.369[CHAIN]           error: Error pushing notifyForkchoiceUpdate() headBlockHash=0x1d65020475ccd4e6ce36034300c12d6e499f82019bbf1fb462a495d4df58f0b1, finalizedBlockHash=0xc8d43bc8019581a21
77a99cbbee1e6764d4d5551a9f0b8051dca756d9916e334 Timeout request
Jul-03 06:33:05.428[CHAIN]           error: Block error slot=47314, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:33:19.460[CHAIN]           error: Block error slot=47315, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=ELERROR, errorMessage=JSON RPC error: Authentication e
rror, engine_newPayloadV1
Jul-03 06:33:41.260[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:33:53.806[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:34:06.049[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:34:18.298[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:34:30.721[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:34:42.963[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:34:47.195[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:34:53.023[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=ELERROR, errorMessage=JSON RPC error: Authentication e
rror, engine_newPayloadV1
Jul-03 06:34:58.874[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:35:01.544[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:35:08.373[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:35:13.809[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:35:25.997[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=ELERROR, errorMessage=JSON RPC error: Authentication e
rror, engine_newPayloadV1
Jul-03 06:35:35.114[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=ELERROR, errorMessage=JSON RPC error: Authentication e
rror, engine_newPayloadV1
Jul-03 06:35:37.610[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:35:49.983[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Timeout request
Jul-03 06:35:51.604[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:35:59.460[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=ELERROR, errorMessage=JSON RPC error: Authentication e
rror, engine_newPayloadV1
Jul-03 06:36:01.934[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:36:04.614[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:36:13.623[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=ELERROR, errorMessage=JSON RPC error: Authentication e
rror, engine_newPayloadV1
Jul-03 06:36:16.322[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:36:19.097[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:36:24.209[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:36:27.143[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:36:30.091[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.
Jul-03 06:36:32.123[CHAIN]           error: Block error slot=47316, errCode=BLOCK_ERROR_EXECUTION_ERROR code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=INVALID, errorMessage=Block 47941 (0x8fcc12fc15a72e45f
e3a2b909bf867542a27fafd9394be540f95be3c7e136c3e) is known to be a part of an invalid chain.

Expected behavior

Steps to Reproduce

Screenshots

Desktop (please complete the following information):

  • OS:
  • Version:
  • Branch:
  • Commit hash:

dapplion avatar Jul 06 '22 21:07 dapplion

@dapplion yes this is the current expected behavior, if EL is unavailable or not responding or throwing error on a block, process block/process segment fails and it gets retried., and will continue to be retried especially if it's a part of canonical chain.

I think we introduced some retry logic which has some wait between retries and hence should pile less, but that PR by @dadepo has never got merged. But it will only reduce pileup.

g11tech avatar Jul 07 '22 03:07 g11tech

~~Metrics i will review and add 👍~~ I think that PR (retry PR) also had metrics regarding the requests.

g11tech avatar Jul 07 '22 03:07 g11tech

but that PR by @dadepo has never got merged.

Which PR?

dapplion avatar Jul 07 '22 13:07 dapplion

but that PR by @dadepo has never got merged.

Which PR?

this one: https://github.com/ChainSafe/lodestar/pull/3854, i think @dadepo was waiting for your final :heavy_check_mark: . will rebase and push

g11tech avatar Jul 07 '22 13:07 g11tech

Closing issues for specific incidents on old versions of Lodestar, please re-open if it happens with latest stable

dapplion avatar May 29 '23 07:05 dapplion