Syncing Base archive nodes - MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD potentially too low
Describe the bug
Over the last few days we've had issues with Base Archive nodes being unable to get back to head from snapshot. I've read a few threads similar. We're now running the new i7ie aws instances, which are fast enough to get to head, however I noticed one thing in my travels that may be impacting Base's sync and I thought it worthwhile to get an experts opinion.
During my investigation I found that the MerkleExecute stage has a constant config: MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD = 5000 https://github.com/paradigmxyz/reth/blob/main/crates/stages/stages/src/stages/merkle.rs#L43C11-L43C47
From what I can tell this causes Reth to rebuild the Merkle data if it's syncing more than 5000 blocks from head. In Base, this takes 2-3 hours on aws nvme ssd (MerkleExecute stage_progress=0.07% stage_eta=2h 44m 13s <- 6000 blocks from head), and on Base, 5000 blocks pass every ~2 hours (compared to 16 hours for ethereum). This causes nodes to get "stuck" just beyond the 5000 block threshold, and if they pass it, MerkleExecute finishes very quickly ( I just watched an i7ie take 13 minutes for 4000 blocks).
I don't know enough about the Merkle data to know for sure if it makes sense to double (or make configurable) that threshold for Base, but one of you will :) Let me know if I've read the situation incorrectly, but I thought it might be useful for others that hit this issue.
Steps to reproduce
Sync a Base archive node from snapshot on an i3en aws instance
Node logs
Platform(s)
No response
Container Type
Kubernetes
What version/commit are you on?
op-reth:v1.2.0
What database version are you on?
Unsure
Which chain / network are you on?
Base Mainnet
What type of node are you running?
Archive (default)
What prune config do you use, if any?
No response
If you've built Reth from source, provide the full command you used
No response
Code of Conduct
- [x] I agree to follow the Code of Conduct
Potentially relates to https://github.com/paradigmxyz/reth/issues/11306 and https://github.com/paradigmxyz/reth/issues/14515
@jamesstanleystewart could you specify which snapshot you've used?
@jamesstanleystewart In addition to which snapshot, do you have timings per stage?
Would also be helpful to have all the flags, memory, cpu you are using on op-reth & op-node.
We've been running Reth for base for 6 months or so now, I think we most likely used a publicnode snapshot originally, but now use our own internal ones.
As for timings, when the node struggles to get in sync, MerkleExecute takes 90%+ of the ~3 hour sync time. Once it crosses the 5000 block threshold, MerkleExecute takes something like 50% of the total (much shorter - 15 mins or less) sync time.
To be clear, moving to the new AWS i7ie family has provided us enough extra disk performance to get into sync. I assume this is because it can do the Merkle rebuild faster and cross the 5000 block chasm.
CPU/Memory - previously they ran on i3en.(2 or 3 we tried both)xlarge machines, now they run on i7ie.2xlarge successfully.
op-reth:
--datadir=/data
--ipcpath=/data/reth.ipc
<http and ws api args>
--max-inbound-peers=50
--max-outbound-peers=50
--rollup.disable-tx-pool-gossip
--rollup.sequencer-http=https://mainnet-sequencer.base.org
--chain=base
--authrpc.addr=127.0.0.1
--authrpc.port=8551
--authrpc.jwtsecret=/tmp/op/jwt-secret.txt
--rpc.gascap=500000000
--rpc.max-response-size=500
--metrics=0.0.0.0:9090
op-node
OP_NODE_L2_ENGINE_KIND: reth
OP_NODE_RPC_ADDR: 0.0.0.0
OP_NODE_RPC_PORT: 9545
OP_NODE_P2P_BOOTNODES: <some bootnoodes>
OP_NODE_P2P_LISTEN_IP: 0.0.0.0
OP_NODE_P2P_LISTEN_TCP_PORT: 30303
OP_NODE_P2P_LISTEN_UDP_PORT: 30303
OP_NODE_P2P_ADVERTISE_TCP: 30303
OP_NODE_P2P_ADVERTISE_UDP: 30303
OP_NODE_P2P_PEERSTORE_PATH: /p2p/peers
OP_NODE_P2P_DISCOVERY_PATH: /p2p/discovery
OP_NODE_P2P_DISABLE: false
OP_NODE_P2P_NO_DISCOVERY: false
OP_NODE_METRICS_ENABLED: true
OP_NODE_METRICS_ADDR: 0.0.0.0
OP_NODE_METRICS_PORT: 9091
OP_NODE_L2_ENGINE_AUTH: /tmp/op/jwt-secret.txt
OP_NODE_VERIFIER_L1_CONFS: 0
OP_NODE_LOG_FORMAT: json
OP_NODE_LOG_LEVEL: info
OP_NODE_PPROF_ENABLED: false
OP_NODE_PPROF_PORT: 6666
OP_NODE_PPROF_ADDR: 0.0.0.0
OP_NODE_L2_ENGINE_RPC: http://localhost:8551
OP_NODE_NETWORK: base-mainnet
OP_NODE_L1_ETH_RPC: <our l1 node>
OP_NODE_L1_TRUST_RPC: false
OP_NODE_L1_BEACON: <our l1 node>
OP_NODE_L1_BEACON_IGNORE: false
OP_NODE_ROLLUP_LOAD_PROTOCOL_VERSIONS: true
OP_NODE_SYNCMODE: execution-layer
here's a chart showing the block number and sync improvement when we moved to i7ie
here's a chart showing the block number and sync improvement when we moved to i7ie
..
I'm wondering if your peer availability is giving you this throughput/speed. It would be helpful if you could provide per stage timings as I see Execution for example, also take some time, but Merkle stage is longest, yes. That makes the node fall behind forever.
FYI:
I tried the following on this git hash:
[2025-04-04T18:01:26.321Z] #17 1.052 Note: switching to 'a38c991c363d241894867a89324b8670be2f6a44'.
[2025-04-04T18:01:26.321Z] #17 1.052
[2025-04-04T18:01:26.321Z] #17 1.052 You are in 'detached HEAD' state. You can look around, make experimental
[2025-04-04T18:01:26.321Z] #17 1.052 changes and commit them, and you can discard any commits you make in this
[2025-04-04T18:01:26.321Z] #17 1.052 state without impacting any branches by switching back to a branch.
[2025-04-04T18:01:26.321Z] #17 1.052
[2025-04-04T18:01:26.321Z] #17 1.052 If you want to create a new branch to retain commits you create, you may
[2025-04-04T18:01:26.321Z] #17 1.052 do so (now or later) by using -c with the switch command. Example:
[2025-04-04T18:01:26.321Z] #17 1.052
[2025-04-04T18:01:26.321Z] #17 1.052 git switch -c <new-branch-name>
[2025-04-04T18:01:26.321Z] #17 1.052
[2025-04-04T18:01:26.321Z] #17 1.052 Or undo this operation with:
[2025-04-04T18:01:26.321Z] #17 1.052
[2025-04-04T18:01:26.321Z] #17 1.052 git switch -
[2025-04-04T18:01:26.321Z] #17 1.052
[2025-04-04T18:01:26.321Z] #17 1.052 Turn off this advice by setting config variable advice.detachedHead to false
[2025-04-04T18:01:26.321Z] #17 1.052
[2025-04-04T18:01:26.747Z] #17 1.225 pub const MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD: u64 = 100_000;
[2025-04-04T18:01:26.747Z] #17 1.225 Self::Execution { clean_threshold: MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD }
[2025-04-04T18:01:26.747Z] #17 1.343 Updating crates.io index
[2025-04-04T18:01:27.171Z] #17 ...
MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD -> 100K; The node still does range based syncing and does horribly on all stages.
I'm using i7ie.12xlarge in EKS with local disk mounted, but this is the only daemon running in the node (I also have some running with io2 with very high IOPS). Only this option has worked while EBS (gp3/io2) have not.
Memory: 128 GiB CPU: 22000m
The resources are definitely underutilized:
I wonder if the stages could just process less at a time it would end up being faster?
Hey everyone!
Just wanted to confirm that we are running into the same problem as the author of the issue. Currently MerkleExecute takes 2-3 hours to finish with the default threshold of 5000. We are currently using i4i and i7ie instance types from AWS.
At first we were trying to get an base archive node in sync with EBS (gp3/io2) but it would never catch up (as described in other issues on this repo), so we switched to instance attached storage instances (i4i/i7ie).
While the stage is slow, it is able to catch up after 12-24 hours after a snapshot has been applied. Based on observation i think the default value for op-reth probably needs to be increased slightly so initial catchup from snapshot is working as expected. The current time it takes for MerkleExecute is very close to the average time it takes for 5000 blocks to be produced on the base blockchain.
Let me know if we can provide more info!
op-node version: v1.12.2
op-reth version: v1.3.7 (just updated, before we were on v1.3.4)
We are also seeing the same behavior which causes us issues getting nodes synced to head. I tried to increase the threshold for Merkle from 5000 to 30000 and it seems to solve the issue and the node is getting closer to be in sync.
I changed the Merkle threshold at the rightmost annotation line (blue dashed)
What is the reason for this threshold? Are there any implications on resource usage when I change this? Are there any implications on the data on the node or are all blocks, states, etc still solid when I change this? Any other reasons why I shouldn't increase this threshold?
We are also seeing the same behavior which causes us issues getting nodes synced to head. I tried to increase the threshold for Merkle from 5000 to 30000 and it seems to solve the issue and the node is getting closer to be in sync.
I changed the Merkle threshold at the rightmost annotation line (blue dashed)
What is the reason for this threshold? Are there any implications on resource usage when I change this? Are there any implications on the data on the node or are all blocks, states, etc still solid when I change this? Any other reasons why I shouldn't increase this threshold?
This 30k trick worked for me too. I'm also running a local node with only 64GB memory. To get over this chunk, I also had to add a swapfile and turn off my oomkiller.
Can we expect a fix in the next release?
The way the merkle stage works right now is:
- We check the range of blocks being executed and compare it with the threshold, if it is lower than the threshold, we do "incremental" calculation which just updates the trie based on the state update in the range of blocks. The incremental calculation uses lots of memory when being run on a large range of blocks.
- If the range of blocks is larger than the threshold, we redo the entire root calculation, because the full root calculation has a mode which can stop and resume after committing data, allowing it to make progress on lots of data without running into memory limits.
This is the current reason for the threshold. While it's somewhat surprising to me that the algorithm is performant enough when running partly on swap, it makes sense that it's faster than clean root calculation @godsflaw
I think the path forward here is for us to implement a way for the "incremental" calculation, to be actually incremental, and make progress while committing to DB and avoiding OOMs.
We're experiencing the same issue, and applying the 30k patch didn't help in our case. We're constantly lagging 7k blocks behind, with the MerkleExecute stage taking ~3-4h per run.
Is there any way to easily determine if the node is actually using the incremental calculation, or still applying the clean root calc?
I just experienced the same issue on an Epyc 7773X with 4x SN850X 8TB NVMe drives. The hardware should be more than enough to handle any blockchain node.
When I was using the default config from the Base RPC snapshot on docs.base.org - the sync would be stuck at ~5000 behind live, with MerkleExecution taking 3 hours - which is also about 5000 blocks.
I fixed the issue by changing the following settings in reth.toml
[stages.merkle]
clean_threshold = 30000
After that, the ~5000 block MerkleExecution stage that previous took 3 hours, finished in 8 minutes. And the node was able to catch up to live quickly after that.
I looked into the code, and I actually don't think changing MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD would fix it. MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD is only used in the debug merkle command, and in particular, MerkleStage::default_execution() - that's not what's being used during node sync.
Instead, I think we should target the default setting passed to MerkleStage::new_execution() in crate/stages/stages/src/sets.rs. The default settings for this comes from https://github.com/paradigmxyz/reth/blob/4a6b2837e6c1662d2b21a06638b5e1228d986c6e/crates/config/src/config.rs#L322.
I fixed the issue by changing the following settings in reth.toml
I also noticed this after unsuccessfully setting MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD in the code, and increasing the clean_threshold to 30k finally let our nodes sync as well.
In the process I however noticed another item that I would consider a bug:
When restarting a node that was doing the clean root calc with the modified setting, it would continue where it left off, but then not doing the incremental sync only on the range between snapshot and target (as used across all other stages), but between latest root calc snapshot and target, i.e. a potentially much larger block range than the specified 30k blocks (at least that's my interpretation without crunching through the details in the code). In one instance, this resulted in the node ending up in an OOM loop.
Think it would be better to consider the current MerkleExecute range on restarts & reevaluate which method should be applied, rather than only looking at the range to be covered by the stages.
I fixed the issue by changing the following settings in reth.toml
I also noticed this after unsuccessfully setting
MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLDin the code, and increasing theclean_thresholdto 30k finally let our nodes sync as well.In the process I however noticed another item that I would consider a bug:
When restarting a node that was doing the clean root calc with the modified setting, it would continue where it left off, but then not doing the incremental sync only on the range between snapshot and target (as used across all other stages), but between latest root calc snapshot and target, i.e. a potentially much larger block range than the specified 30k blocks (at least that's my interpretation without crunching through the details in the code). In one instance, this resulted in the node ending up in an OOM loop.
Think it would be better to consider the current
MerkleExecuterange on restarts & reevaluate which method should be applied, rather than only looking at the range to be covered by the stages.
I am also in OOM loop. Tried both MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD and clean_threshold fixes.
Note that I never reached the end of stages. It crashed OOM at MerkleExecute, while syncing from th official snapshot from Base website.
Last log:
{"timestamp":"2025-05-14T05:29:13.099333Z","level":"INFO","fields":{"message":"Status","connected_peers":33,"stage":"MerkleExecute","checkpoint":30078351,"target":"30190674"}}
i3en.2xlarge - 64GB ram and local NVME reth v1.3.12 Base blockchain
@phil-k1 Did you manage to fix the OOM loop?
Chiming in to comment that I was affected by this issue in Base as well. Fixed by:
[stages.merkle]
clean_threshold = 30000
@phil-k1 Did you manage to fix the OOM loop?
The fix was to set the MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD / clean_threshold back to default, let the current iteration of MerkleExecute complete, then shut down the node before the next MerkleExecute iteration is started & set clean_threshold = 30000 again.
BTW we're working on this here: https://github.com/paradigmxyz/reth/pull/16178
Which should allow people to make progress on merkle stage on base with no oom loops and a much higher clean threshold
BTW we're working on this here: #16178
Which should allow people to make progress on merkle stage on base with no oom loops and a much higher clean threshold
when's this releasing? can test on our replicas
when's this releasing? can test on our replicas
@exp0nge likely early next week, testing would be greatly appreciated! We've only tested on our own boxes which are fairly fast, so it would be great for people who are experiencing this now to check that main now fixes the issue.
when's this releasing? can test on our replicas
@exp0nge likely early next week, testing would be greatly appreciated! We've only tested on our own boxes which are fairly fast, so it would be great for people who are experiencing this now to check that
mainnow fixes the issue.
I'm trying this on:
- i7ie.12xlarge
- Memory Limit + Requests: 128Gi
- CPU Requests: 800m
- Local disk
- Build commit:
67e3c11 - Args:
op-reth node -vvv --log.stdout.format log-fmt --ws --ws.port=8546 --ws.addr=0.0.0.0 --ws.origins=* --ws.api=debug,eth,net,txpool --http --http.port=8545 --http.addr=0.0.0.0 --http.corsdomain=* --authrpc.addr=0.0.0.0 --authrpc.jwtsecret=/mnt/secrets/jwt --authrpc.port=8551 --verbosity --rollup.disable-tx-pool-gossip --port=30303 --rollup.sequencer-http=https://mainnet-sequencer.base.org --datadir=/mnt/data/datadir/reth --chain=base --http.api=eth,net,debug,net,txpool --metrics=0.0.0.0:6060 --rollup.discovery.v4
I get logs around (trailing ~2 hrs):
│ ts=2025-06-20T17:14:56.0438006Z level=info target=reth::cli message=Status connected_peers=9 stage=MerkleExecute checkpoint=31808643 target=31824255 │
There was 1 OOMKilled event, so I will try more memory, but what is the expected target?
@exp0nge hmm, there should be no OOM event on 128G of ram, this should take max 12G, recommended 16G. Can you show any grafana dashboards, specifically the jemalloc memory section, the Sync progress (stage progress in %) and Sync progress (stage progress as highest block number reached) sections? As well as any debug logs from around that time, especially sync::stages::merkle::exec
Could you also share the reth.toml?
@exp0nge thanks, I think that memory usage chart may be misleading due to our usage of mmap, ie, I'm not sure that when it reaches 128G, that it would always cause an OOM. But thank you for the info. Just to confirm, outside of the OOM, is the node making progress? How does it compare to before?
@exp0nge thanks, I think that memory usage chart may be misleading due to our usage of mmap, ie, I'm not sure that when it reaches 128G, that it would always cause an OOM. But thank you for the info. Just to confirm, outside of the OOM, is the node making progress? How does it compare to before?
This replica was not making progress fast enough. It always ended up lingering ~2 hours behind. However, another replica I have did make progress even before this PR so I just reset the 'bad' one to see if this same issue persist using the PR you provided.
@exp0nge sounds good, keep me up to date!
I have seen the OOM issue on earlier versions of reth when the SSD is too slow. Something builds up and doesn't clear like it should. Don't think your i7i instance will cut it. Check actual read/write speeds. Can recommend dedicated instances at Hetzner or Ovh instead. You get proper NVMe's there.