reth icon indicating copy to clipboard operation
reth copied to clipboard

Error: not able to find Headers static file for block

Open argakiig opened this issue 10 months ago • 7 comments

Describe the bug

Node was syncing, and then it crashed due to a long lived read transaction and then the node restarted

This is a PVC backed by a NFS volume via Google Filestore

Steps to reproduce

  1. Sync node using docker image on kubernetes with NFS backed PVC for data
  2. Watch for long lived read transactions in the logs followed by an apparent restart

Node logs

2024-04-02T13:15:50.097537Z DEBUG net::tx: sending `GetPooledTransactions` request to peer's session failed, buffering hashes peer_id="0x20d3…a495" failed_to_request_hashes={0xb1f33474e2549b96b7e66de637b8c506519ef033789a437ea30180f0a6e23349} conn_eth_version=Eth68 client_version=Geth/v1.13.14-stable-2bd6bd01/linux-amd64/go1.21.6
2024-04-02T13:15:50.100572Z DEBUG net::tx: failed sending request to peer's session, buffering hashes peer_id="0x20d3…a495" failed_to_request_hashes=RequestTxHashes { hashes: {0xa5da245418158f0be389a2a484d6795743e954e3f9c74c2027b9ca39f64ccc01, 0xb1f33474e2549b96b7e66de637b8c506519ef033789a437ea30180f0a6e23349, 0xbbc592e731b4751629315bd50358ef16e887828077fdbfd880a917b55fd853c3} } conn_eth_version=Eth68
2024-04-02T13:15:50.127650Z DEBUG net::tx: sending `GetPooledTransactions` request to peer's session failed, buffering hashes peer_id="0x6620…e524" failed_to_request_hashes={0xfb97a7a47072a2576bf151b0ce17c5000f7d018eb863833a8a170cc47bf54127, 0x9cd803d229eb283406caa4f3922408fe1500ba81a9d6c038d5b46e086aca1604, 0x8ab61139a9dc29c1d9446126a3f60610a06b51b615908d6e736cc07e83a40063, 0x585bc731fbab56cb2f878717f227fe8a5c96a12d6da5cb92c1204453be8700ef, 0xa22d9f7dbc27a73df6cdfc7fc01d789d2f97c42be21679f2bd4f63be447c9a90, 0x3637d9923a2ba76053215d51804aa8862ca14b7ddffe8e87c46f029b2f666157, 0x355763a6aaad7694771334b0221187ee0b234dcc490835d964e6a7a4a66d40c9, 0xbe1899359218a1a49886e20ccd88c8a01fa860c42d7f7c6aa639f93609130fb6, 0xff168973202947ae37e94885ac85b76f24adb1924313d76056f462e3650b9b17, 0x794620d7e3afb156bb1e6b76b21e636a56ac490c0fa48f0faa0dfeed1e2bcbe4, 0x59e740879d4e5c27ea9ac90c2e3a25dbaae43f9c22b0b2c40b18c6045626f344, 0xd17185ddd59c91ee296db5cf6352eb8ccc1d0cb0aaf5380517e9ce38e803ca41, 0x7b567872521e407e520b07177769315122df633e016fd6479d218e42ef1c74fc} conn_eth_version=Eth68 client_version=Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2
2024-04-02T13:15:50.129965Z DEBUG net::tx: failed sending request to peer's session, buffering hashes peer_id="0x6620…e524" failed_to_request_hashes=RequestTxHashes { hashes: {0x7b567872521e407e520b07177769315122df633e016fd6479d218e42ef1c74fc, 0x3ee7c1a34eca5a9f796d43d656d601ba4e46b4624f9a1c774ad4703a38b8815c, 0x355763a6aaad7694771334b0221187ee0b234dcc490835d964e6a7a4a66d40c9, 0x3637d9923a2ba76053215d51804aa8862ca14b7ddffe8e87c46f029b2f666157, 0xfb97a7a47072a2576bf151b0ce17c5000f7d018eb863833a8a170cc47bf54127, 0x59e740879d4e5c27ea9ac90c2e3a25dbaae43f9c22b0b2c40b18c6045626f344, 0x8ab61139a9dc29c1d9446126a3f60610a06b51b615908d6e736cc07e83a40063, 0xce013dec228fa243ebb8d9e55f0a7aa55d3f5e81e9abcb11a0a2eea94b560acd, 0x9cd803d229eb283406caa4f3922408fe1500ba81a9d6c038d5b46e086aca1604, 0x63ef2e59e4fbe27c330ae1407754cc2abfa3b907316458e54fc2d4a01accef02, 0x59ea08878b6af4f88bd30eb99427fe472d1ae9a85a1a264d39085d6905cd17e0, 0x794620d7e3afb156bb1e6b76b21e636a56ac490c0fa48f0faa0dfeed1e2bcbe4, 0x585bc731fbab56cb2f878717f227fe8a5c96a12d6da5cb92c1204453be8700ef, 0xe9ed83069f4a4dfa75d4654d43c9ea2a983a1f4c2f2d2b6dd4b8787561ca02ef, 0xff168973202947ae37e94885ac85b76f24adb1924313d76056f462e3650b9b17, 0x6197497a23b4bded562a7da10851877fcc5684fcebb086013b650dd07d34d5d2, 0xa22d9f7dbc27a73df6cdfc7fc01d789d2f97c42be21679f2bd4f63be447c9a90, 0xbe1899359218a1a49886e20ccd88c8a01fa860c42d7f7c6aa639f93609130fb6, 0xffa03e5c59cc27e08c572f6186517570057d11886f2d4d38a17ca4f067dc7f81, 0xd17185ddd59c91ee296db5cf6352eb8ccc1d0cb0aaf5380517e9ce38e803ca41} } conn_eth_version=Eth68
2024-04-02T13:15:50.152319Z DEBUG net::tx: sending `GetPooledTransactions` request to peer's session failed, buffering hashes peer_id="0x7e6d…3051" failed_to_request_hashes={0xcb64348c2d92c4903d4b5181c4d172e45d9d7690da629a85895df54b4117a5c2} conn_eth_version=Eth68 client_version=Geth/v1.13.14-stable-2bd6bd01/linux-amd64/go1.21.7
2024-04-02T13:15:50.152590Z DEBUG net::tx: sending `GetPooledTransactions` request to peer's session failed, buffering hashes peer_id="0xe229…3d4d" failed_to_request_hashes={0x59c3019c670a6a3d0a731ba113f9770a5b9286ecb524d6aceb339faf8a8b0abb, 0x611717f6cafcfbe8f45d09550e329608da3f98d38ab8e0086af813dce84c51d1, 0xa7cf8b795085c5433b61a5dd2f5cc5b915ebae4fae885db383c3163e4331e2c9, 0x308d9d7487205a40b311284e2210d40c9371251fbb861a942269421f7f120a6c, 0x4ac0478d7683d217ce943a3308e7d9a244d1ce33b864c4f1dad65296010c6546} conn_eth_version=Eth68 client_version=Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2
2024-04-02T13:15:50.154367Z DEBUG net::tx: failed sending request to peer's session, buffering hashes peer_id="0xe229…3d4d" failed_to_request_hashes=RequestTxHashes { hashes: {0x00393d94d268405681d9068203790121d00a5bac408d80e1f5180b2414318fc3, 0x6945b41296423afe1a82ac1933c42a5aad6796432d845aca9d6354dfdcc0569b, 0x66392986d72ffe3ebfb9079b7c9ea4e999d19a7de818400fde00300c45fbbaac, 0x7ff34560287743c6b2051fb6787ab26cd70116241d68dd590077315dd6792902, 0xdbc4ca73d39caddedaf5e7bc5f04e05d49319f20a5c06462af9b083ac956e329, 0x1db6e829d6bc8ab874f1bde170a30d7f08a31a788ca0c5ee6781ec967cc6fb82, 0x00d50e18c9ab8aaef0c07439b0914afe67e14b82a9a389487aecb87baadfe6f9, 0xc982d1cde7d88a8f6582003570367b71c5154f72c636a5dcd9c79c7f6a7387c3, 0x308d9d7487205a40b311284e2210d40c9371251fbb861a942269421f7f120a6c, 0xd1a4ce9400cedc5d863f35ee93b157e1814a483df560ecbcbfea49b4552e082a, 0xa7cf8b795085c5433b61a5dd2f5cc5b915ebae4fae885db383c3163e4331e2c9, 0x150674df2e4f10abe209b6137d0022472912eaa5b28350cec238a4a446d274bd, 0xbcee3fb3b7cf30e81adaaf6bd00c59f619020a298a2390009d33efa71ecf4598, 0x59c3019c670a6a3d0a731ba113f9770a5b9286ecb524d6aceb339faf8a8b0abb, 0x611717f6cafcfbe8f45d09550e329608da3f98d38ab8e0086af813dce84c51d1, 0x5baff86a84dd57554fa738b6ce9fe8ab3bb3d34ca9cd620e41d21b2ebedbc817, 0x6277784a783ac88d313d8a263167b428d2b6937191b67043cf11e1f5a7d057bb, 0x4ac0478d7683d217ce943a3308e7d9a244d1ce33b864c4f1dad65296010c6546} } conn_eth_version=Eth68
2024-04-02T13:15:50.560940Z TRACE rpc::eth: Serving eth_call request=TransactionRequest { from: Some(0x0000000000000000000000000000000000000000), to: Some(0x8315177ab297ba92a06054ce80a67ed4dbd7ed3a), gas_price: None, max_fee_per_gas: None, max_priority_fee_per_gas: None, max_fee_per_blob_gas: None, gas: None, value: None, input: TransactionInput { input: Some(0xeca067ad), data: None }, nonce: None, chain_id: None, access_list: None, transaction_type: None, blob_versioned_hashes: None, sidecar: None, other: OtherFields { inner: {} } } block_number=Some(Number(Number(19568322))) state_overrides=None block_overrides=None
2024-04-02T13:15:53.975584Z  WARN libmdbx: Long-lived read transaction has been timed out open_duration=300.000088517s
2024-04-02T13:15:57.967510Z  WARN libmdbx: Long-lived read transaction has been timed out open_duration=300.000246831s
2024-04-02T14:58:36.375077Z  INFO reth::cli: Starting reth version="0.2.0-beta.4 (c04dbe6e9)"
2024-04-02T14:58:36.375723Z  INFO reth::cli: Opening database path="/data/db"
2024-04-02T14:58:47.051432Z  INFO reth::cli: Configuration loaded path="/data/config.toml"
2024-04-02T14:58:47.165333Z  INFO reth::cli: Database opened
2024-04-02T14:58:47.165370Z  INFO reth::cli: Starting metrics endpoint addr=0.0.0.0:9001
2024-04-02T14:58:47.165484Z DEBUG reth::cli: Initializing genesis chain=mainnet genesis=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3
2024-04-02T14:58:47.171081Z DEBUG reth_node_core::init: Genesis already written, skipping.
2024-04-02T14:58:47.171140Z  INFO reth::cli: 
Pre-merge hard forks (block based):
- Frontier                         @0
- Homestead                        @1150000
- Dao                              @1920000
- Tangerine                        @2463000
- SpuriousDragon                   @2675000
- Byzantium                        @4370000
- Constantinople                   @7280000
- Petersburg                       @7280000
- Istanbul                         @9069000
- MuirGlacier                      @9200000
- Berlin                           @12244000
- London                           @12965000
- ArrowGlacier                     @13773000
- GrayGlacier                      @15050000
Merge hard forks:
- Paris                            @58750000000000000000000 (network is known to be merged)
Post-merge hard forks (timestamp based):
- Shanghai                         @1681338455
- Cancun                           @1710338135
2024-04-02T14:58:47.171162Z DEBUG reth::cli: Spawning stages metrics listener task
2024-04-02T14:58:47.182386Z  WARN provider::static_file: Could not find block or tx number on a range request segment=Headers number=19568228
2024-04-02T14:58:47.182470Z ERROR reth::cli: shutting down due to error

Platform(s)

Linux (x86)

What version/commit are you on?

0.2.0-beta.4 (c04dbe6e9)"

What database version are you on?

Current database version: 2 Local database version: 2

What type of node are you running?

Archive (default)

What prune config do you use, if any?

N/a

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

  • [X] I agree to follow the Code of Conduct

argakiig avatar Apr 08 '24 17:04 argakiig

Here are logs for a "WARN Could not find block ...". In that zip file is reth.log and reth.log.starting-reth-version.could-not-find-block which was created with:

kendall@ethel 24-04-09T18:38:53.440Z:~/eth2-data/reth/holesky
$ rg 'Starting reth version|Could not find block' reth.log > reth.log.starting-reth-version.could-not-find-block
kendall@ethel 24-04-09T18:39:20.716Z:~/eth2-data/reth/holesky

When you look at reth.log.starting-reth-version.could-not-find-block the file has 3350 "Starting reth" from "2024-04-08T17:26:10.127770Z" .. "2024-04-09T00:42:05.917330Z". After "2024-04-09T00:42:05.917330Z" I stopped reth when I discovered that my node was rebooting because of the "Failed to insert block" issue that's being tracked in issue 7430. Eventually I dropped the merkle stage and restarted reth at "2024-04-09T03:47:22.649347Z".

Then today I thought I'd add a panic when an InsertBlockError occurs, I did that and accidentally started reth.service which has unmodified version "0.2.0-beta.5 (55f75cdcc)". I stopped it and then started my modified version "0.2.0-beta.5-dev (cd1c2bf99)" and then we see the "Could not find block".

So I'm guessing the problem related to this issue occurred some short time before "2024-04-09T17:21:33.262346Z" (i.e. when I stopped reth.service) and "2024-04-09T17:24:14.711716Z " when the "Could not find block" happened.

2024-04-08T17:26:10.127770Z  INFO reth::cli: Starting reth version="0.2.0-beta.5 (54f75cdcc)"
...
2024-04-09T00:42:05.917330Z  INFO reth::cli: Starting reth version="0.2.0-beta.5 (54f75cdcc)"
2024-04-09T03:47:22.649347Z  INFO reth::cli: Starting reth version="0.2.0-beta.5 (54f75cdcc)"
2024-04-09T17:21:33.262346Z  INFO reth::cli: Starting reth version="0.2.0-beta.5 (54f75cdcc)"
2024-04-09T17:24:14.670824Z  INFO reth::cli: Starting reth version="0.2.0-beta.5-dev (cd1c2bf99)"
2024-04-09T17:24:14.711716Z  WARN provider::static_file: Could not find block or tx number on a range request segment=Headers number=1311784

Hopefully these logs and maybe my observations are helpful.

winksaville avatar Apr 09 '24 19:04 winksaville

Here are logs for a "WARN Could not find block ...". In that zip file is reth.log and reth.log.starting-reth-version.could-not-find-block which was created with:

kendall@ethel 24-04-09T18:38:53.440Z:~/eth2-data/reth/holesky
$ rg 'Starting reth version|Could not find block' reth.log > reth.log.starting-reth-version.could-not-find-block
kendall@ethel 24-04-09T18:39:20.716Z:~/eth2-data/reth/holesky

When you look at reth.log.starting-reth-version.could-not-find-block the file has 3350 "Starting reth" from "2024-04-08T17:26:10.127770Z" .. "2024-04-09T00:42:05.917330Z". After "2024-04-09T00:42:05.917330Z" I stopped reth when I discovered that my node was rebooting because of the "Failed to insert block" issue that's being tracked in issue 7430. Eventually I dropped the merkle stage and restarted reth at "2024-04-09T03:47:22.649347Z".

Then today I thought I'd add a panic when an InsertBlockError occurs, I did that and accidentally started reth.service which has unmodified version "0.2.0-beta.5 (55f75cdcc)". I stopped it and then started my modified version "0.2.0-beta.5-dev (cd1c2bf)" and then we see the "Could not find block".

So I'm guessing the problem related to this issue occurred some short time before "2024-04-09T17:21:33.262346Z" (i.e. when I stopped reth.service) and "2024-04-09T17:24:14.711716Z " when the "Could not find block" happened.

2024-04-08T17:26:10.127770Z  INFO reth::cli: Starting reth version="0.2.0-beta.5 (54f75cdcc)"
...
2024-04-09T00:42:05.917330Z  INFO reth::cli: Starting reth version="0.2.0-beta.5 (54f75cdcc)"
2024-04-09T03:47:22.649347Z  INFO reth::cli: Starting reth version="0.2.0-beta.5 (54f75cdcc)"
2024-04-09T17:21:33.262346Z  INFO reth::cli: Starting reth version="0.2.0-beta.5 (54f75cdcc)"
2024-04-09T17:24:14.670824Z  INFO reth::cli: Starting reth version="0.2.0-beta.5-dev (cd1c2bf99)"
2024-04-09T17:24:14.711716Z  WARN provider::static_file: Could not find block or tx number on a range request segment=Headers number=1311784

Hopefully these logs and maybe my observations are helpful.

looking through the logs it seems like yours presented differently than mine, and considerably sooner, then again it seems you are also using holesky?

argakiig avatar Apr 09 '24 20:04 argakiig

looking through the logs it seems like yours presented differently than mine, and considerably sooner, then again it seems you are also using holesky?

Yes I'm on holesky

winksaville avatar Apr 09 '24 22:04 winksaville

This issue is stale because it has been open for 21 days with no activity.

github-actions[bot] avatar May 01 '24 01:05 github-actions[bot]

This issue is still present in beta.6. My node's logs

The RPC endpoint became unresponsive, so I restarted reth via systemd, and it was broken when it started back up.

What is the best way to go about fixing my node? Can I rewind or will I require a resync?

vincenttaglia avatar May 10 '24 02:05 vincenttaglia

@vincenttaglia looking at your logs, could be that the prune checkpoint didn't save properly when process killed. would be good with trace log level enabled for targets rpc to try and get more info on what work the node is doing when rpc is unresponsive. other interesting targets would be consensus and pruner.

RUST_LOG=rpc,consensus,pruner=trace <your command to run the node>

emhane avatar May 17 '24 13:05 emhane

also targets provider::static_file and static_file

emhane avatar May 17 '24 13:05 emhane

@vincenttaglia looking at your logs, could be that the prune checkpoint didn't save properly when process killed. would be good with trace log level enabled for targets rpc to try and get more info on what work the node is doing when rpc is unresponsive. other interesting targets would be consensus and pruner.

RUST_LOG=rpc,consensus,pruner=trace <your command to run the node>

It happened again yesterday. I have trace logs on, but I'm not sure my logs caught anything from the period that it started to go unresponsive. Nevertheless here are my logs with rpc::eth: lines removed to fit in a gist. I can also upload the full logs if that would help debug.

I am using reth for indexing so there are a lot of RPC calls, but I would expect the node to drop some of those queries before dropping the sync procedure.

https://gist.github.com/vincenttaglia/4e6d678563ebef07543d5e0bb2d6219d

vincenttaglia avatar May 23 '24 15:05 vincenttaglia

same error... my server shut down unexpectly, when restart the RETH cannot restart with the error "ERROR shutting down due to error" Error: not able to find Headers static file for block number 19979823

matic0209 avatar May 30 '24 03:05 matic0209

@vincenttaglia looking at your logs, could be that the prune checkpoint didn't save properly when process killed. would be good with trace log level enabled for targets rpc to try and get more info on what work the node is doing when rpc is unresponsive. other interesting targets would be consensus and pruner.

RUST_LOG=rpc,consensus,pruner=trace <your command to run the node> how to fix my node,pls

matic0209 avatar May 30 '24 03:05 matic0209

last canonicalized head is logged as block 19917981 May 21 11:27:52 reth start-reth.sh[1615]: 2024-05-21T11:27:52.334848Z DEBUG consensus::engine: Canonicalized new head hash=0x2312348964cf8cc1a2f7cfbbab7ff8210c2dedcb0852cbd2d8a47e6504f6831a number=19917981 so I'm wondering why on startup, the node looks for block 19922274 May 23 15:26:50 reth start-reth.sh[59660]: Error: not able to find Headers static file for block number 19922274

possibly related to https://github.com/paradigmxyz/reth/issues/8203

emhane avatar May 31 '24 17:05 emhane

could you try and repro again with trace enabled for targets provider::static_file and static_file as well, or just enabled for all targets, so we can get all the available context for the bug @vincenttaglia

emhane avatar May 31 '24 17:05 emhane

This issue is stale because it has been open for 21 days with no activity.

github-actions[bot] avatar Jun 22 '24 01:06 github-actions[bot]

This should theoretically not happen anymore . feel free to reopen if it still does in the latest version

joshieDo avatar Jul 12 '24 08:07 joshieDo