reth Syncing Base archive nodes - MERKLE_STAGE_DEFAULT_CLEAN

Describe the bug

Over the last few days we've had issues with Base Archive nodes being unable to get back to head from snapshot. I've read a few threads similar. We're now running the new i7ie aws instances, which are fast enough to get to head, however I noticed one thing in my travels that may be impacting Base's sync and I thought it worthwhile to get an experts opinion.

During my investigation I found that the MerkleExecute stage has a constant config: MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD = 5000 https://github.com/paradigmxyz/reth/blob/main/crates/stages/stages/src/stages/merkle.rs#L43C11-L43C47

From what I can tell this causes Reth to rebuild the Merkle data if it's syncing more than 5000 blocks from head. In Base, this takes 2-3 hours on aws nvme ssd (MerkleExecute stage_progress=0.07% stage_eta=2h 44m 13s <- 6000 blocks from head), and on Base, 5000 blocks pass every ~2 hours (compared to 16 hours for ethereum). This causes nodes to get "stuck" just beyond the 5000 block threshold, and if they pass it, MerkleExecute finishes very quickly ( I just watched an i7ie take 13 minutes for 4000 blocks).

I don't know enough about the Merkle data to know for sure if it makes sense to double (or make configurable) that threshold for Base, but one of you will :) Let me know if I've read the situation incorrectly, but I thought it might be useful for others that hit this issue.

Steps to reproduce

Sync a Base archive node from snapshot on an i3en aws instance

Node logs

Platform(s)

No response

Container Type

Kubernetes

What version/commit are you on?

op-reth:v1.2.0

What database version are you on?

Unsure

Which chain / network are you on?

Base Mainnet

What type of node are you running?

Archive (default)

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

[x] I agree to follow the Code of Conduct

Apr 03 '25 01:04 jamesstanleystewart

Potentially relates to https://github.com/paradigmxyz/reth/issues/11306 and https://github.com/paradigmxyz/reth/issues/14515

Apr 03 '25 01:04 jamesstanleystewart

@jamesstanleystewart could you specify which snapshot you've used?

Apr 03 '25 10:04 jenpaff

With the current implementation, it is almost impossible to reach the last block (RAID 1 Enterprise NVME SSD).

Apr 03 '25 14:04 superkeka

@jamesstanleystewart In addition to which snapshot, do you have timings per stage?

Would also be helpful to have all the flags, memory, cpu you are using on op-reth & op-node.

Apr 03 '25 15:04 exp0nge

We've been running Reth for base for 6 months or so now, I think we most likely used a publicnode snapshot originally, but now use our own internal ones.

As for timings, when the node struggles to get in sync, MerkleExecute takes 90%+ of the ~3 hour sync time. Once it crosses the 5000 block threshold, MerkleExecute takes something like 50% of the total (much shorter - 15 mins or less) sync time.

To be clear, moving to the new AWS i7ie family has provided us enough extra disk performance to get into sync. I assume this is because it can do the Merkle rebuild faster and cross the 5000 block chasm.

CPU/Memory - previously they ran on i3en.(2 or 3 we tried both)xlarge machines, now they run on i7ie.2xlarge successfully.

op-reth:

      --datadir=/data                                                                                                                                                                                                                                
      --ipcpath=/data/reth.ipc                                                                                                                                                                                                                       
     <http and ws api args>                                                                                                                                                                                                                               
      --max-inbound-peers=50                                                                                                                                                                                                                         
      --max-outbound-peers=50                                                                                                                                                                                                                        
      --rollup.disable-tx-pool-gossip                                                                                                                                                                                                                
      --rollup.sequencer-http=https://mainnet-sequencer.base.org                                                                                                                                                                                     
      --chain=base                                                                                                                                                                                                                                   
      --authrpc.addr=127.0.0.1                                                                                                                                                                                                                       
      --authrpc.port=8551                                                                                                                                                                                                                            
      --authrpc.jwtsecret=/tmp/op/jwt-secret.txt                                                                                                                                                                                                     
      --rpc.gascap=500000000                                                                                                                                                                                                                         
      --rpc.max-response-size=500                                                                                                                                                                                                                    
      --metrics=0.0.0.0:9090

op-node

      OP_NODE_L2_ENGINE_KIND:                 reth                                                                                                                                                                                                   
      OP_NODE_RPC_ADDR:                       0.0.0.0                                                                                                                                                                                                
      OP_NODE_RPC_PORT:                       9545                                                                                                                                                                                                   
      OP_NODE_P2P_BOOTNODES:                 <some bootnoodes>                                                                                             
      OP_NODE_P2P_LISTEN_IP:                  0.0.0.0                                                                                                                                                                                                
      OP_NODE_P2P_LISTEN_TCP_PORT:            30303                                                                                                                                                                                                  
      OP_NODE_P2P_LISTEN_UDP_PORT:            30303                                                                                                                                                                                                  
      OP_NODE_P2P_ADVERTISE_TCP:              30303                                                                                                                                                                                                  
      OP_NODE_P2P_ADVERTISE_UDP:              30303                                                                                                                                                                                                  
      OP_NODE_P2P_PEERSTORE_PATH:             /p2p/peers                                                                                                                                                                                             
      OP_NODE_P2P_DISCOVERY_PATH:             /p2p/discovery                                                                                                                                                                                         
      OP_NODE_P2P_DISABLE:                    false                                                                                                                                                                                                  
      OP_NODE_P2P_NO_DISCOVERY:               false                                                                                                                                                                                                  
      OP_NODE_METRICS_ENABLED:                true                                                                                                                                                                                                   
      OP_NODE_METRICS_ADDR:                   0.0.0.0                                                                                                                                                                                                
      OP_NODE_METRICS_PORT:                   9091                                                                                                                                                                                                   
      OP_NODE_L2_ENGINE_AUTH:                 /tmp/op/jwt-secret.txt                                                                                                                                                                                 
      OP_NODE_VERIFIER_L1_CONFS:              0                                                                                                                                                                                                      
      OP_NODE_LOG_FORMAT:                     json                                                                                                                                                                                                   
      OP_NODE_LOG_LEVEL:                      info                                                                                                                                                                                                   
      OP_NODE_PPROF_ENABLED:                  false                                                                                                                                                                                                  
      OP_NODE_PPROF_PORT:                     6666                                                                                                                                                                                                   
      OP_NODE_PPROF_ADDR:                     0.0.0.0                                                                                                                                                                                                
      OP_NODE_L2_ENGINE_RPC:                  http://localhost:8551                                                                                                                                                                                  
      OP_NODE_NETWORK:                        base-mainnet                                                                                                                                                                                           
      OP_NODE_L1_ETH_RPC:                     <our l1 node>                                                                                                                           
      OP_NODE_L1_TRUST_RPC:                   false                                                                                                                                                                                                  
      OP_NODE_L1_BEACON:                      <our l1 node>                                                                                                                           
      OP_NODE_L1_BEACON_IGNORE:               false                                                                                                                                                                                                  
      OP_NODE_ROLLUP_LOAD_PROTOCOL_VERSIONS:  true                                                                                                                                                                                                   
      OP_NODE_SYNCMODE:                       execution-layer

Apr 03 '25 22:04 jamesstanleystewart

here's a chart showing the block number and sync improvement when we moved to i7ie

Apr 03 '25 22:04 jamesstanleystewart

here's a chart showing the block number and sync improvement when we moved to i7ie

..

I'm wondering if your peer availability is giving you this throughput/speed. It would be helpful if you could provide per stage timings as I see Execution for example, also take some time, but Merkle stage is longest, yes. That makes the node fall behind forever.

Apr 05 '25 01:04 exp0nge

FYI:

I tried the following on this git hash:

[2025-04-04T18:01:26.321Z] #17 1.052 Note: switching to 'a38c991c363d241894867a89324b8670be2f6a44'.
[2025-04-04T18:01:26.321Z] #17 1.052 
[2025-04-04T18:01:26.321Z] #17 1.052 You are in 'detached HEAD' state. You can look around, make experimental
[2025-04-04T18:01:26.321Z] #17 1.052 changes and commit them, and you can discard any commits you make in this
[2025-04-04T18:01:26.321Z] #17 1.052 state without impacting any branches by switching back to a branch.
[2025-04-04T18:01:26.321Z] #17 1.052 
[2025-04-04T18:01:26.321Z] #17 1.052 If you want to create a new branch to retain commits you create, you may
[2025-04-04T18:01:26.321Z] #17 1.052 do so (now or later) by using -c with the switch command. Example:
[2025-04-04T18:01:26.321Z] #17 1.052 
[2025-04-04T18:01:26.321Z] #17 1.052   git switch -c <new-branch-name>
[2025-04-04T18:01:26.321Z] #17 1.052 
[2025-04-04T18:01:26.321Z] #17 1.052 Or undo this operation with:
[2025-04-04T18:01:26.321Z] #17 1.052 
[2025-04-04T18:01:26.321Z] #17 1.052   git switch -
[2025-04-04T18:01:26.321Z] #17 1.052 
[2025-04-04T18:01:26.321Z] #17 1.052 Turn off this advice by setting config variable advice.detachedHead to false
[2025-04-04T18:01:26.321Z] #17 1.052 
[2025-04-04T18:01:26.747Z] #17 1.225 pub const MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD: u64 = 100_000;
[2025-04-04T18:01:26.747Z] #17 1.225         Self::Execution { clean_threshold: MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD }
[2025-04-04T18:01:26.747Z] #17 1.343     Updating crates.io index
[2025-04-04T18:01:27.171Z] #17 ...

MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD -> 100K; The node still does range based syncing and does horribly on all stages.

I'm using i7ie.12xlarge in EKS with local disk mounted, but this is the only daemon running in the node (I also have some running with io2 with very high IOPS). Only this option has worked while EBS (gp3/io2) have not.

Memory: 128 GiB CPU: 22000m

The resources are definitely underutilized:

I wonder if the stages could just process less at a time it would end up being faster?

Apr 05 '25 04:04 exp0nge

Hey everyone!

Just wanted to confirm that we are running into the same problem as the author of the issue. Currently MerkleExecute takes 2-3 hours to finish with the default threshold of 5000. We are currently using i4i and i7ie instance types from AWS.

At first we were trying to get an base archive node in sync with EBS (gp3/io2) but it would never catch up (as described in other issues on this repo), so we switched to instance attached storage instances (i4i/i7ie).

While the stage is slow, it is able to catch up after 12-24 hours after a snapshot has been applied. Based on observation i think the default value for op-reth probably needs to be increased slightly so initial catchup from snapshot is working as expected. The current time it takes for MerkleExecute is very close to the average time it takes for 5000 blocks to be produced on the base blockchain.

Let me know if we can provide more info!

op-node version: v1.12.2 op-reth version: v1.3.7 (just updated, before we were on v1.3.4)

Apr 05 '25 16:04 dominik-dezordo-vc

We are also seeing the same behavior which causes us issues getting nodes synced to head. I tried to increase the threshold for Merkle from 5000 to 30000 and it seems to solve the issue and the node is getting closer to be in sync.

I changed the Merkle threshold at the rightmost annotation line (blue dashed)

What is the reason for this threshold? Are there any implications on resource usage when I change this? Are there any implications on the data on the node or are all blocks, states, etc still solid when I change this? Any other reasons why I shouldn't increase this threshold?

Apr 17 '25 09:04 jonathanudd

We are also seeing the same behavior which causes us issues getting nodes synced to head. I tried to increase the threshold for Merkle from 5000 to 30000 and it seems to solve the issue and the node is getting closer to be in sync.

I changed the Merkle threshold at the rightmost annotation line (blue dashed)

What is the reason for this threshold? Are there any implications on resource usage when I change this? Are there any implications on the data on the node or are all blocks, states, etc still solid when I change this? Any other reasons why I shouldn't increase this threshold?

This 30k trick worked for me too. I'm also running a local node with only 64GB memory. To get over this chunk, I also had to add a swapfile and turn off my oomkiller.

Apr 20 '25 20:04 godsflaw

Can we expect a fix in the next release?

Apr 21 '25 05:04 superkeka

The way the merkle stage works right now is:

We check the range of blocks being executed and compare it with the threshold, if it is lower than the threshold, we do "incremental" calculation which just updates the trie based on the state update in the range of blocks. The incremental calculation uses lots of memory when being run on a large range of blocks.
If the range of blocks is larger than the threshold, we redo the entire root calculation, because the full root calculation has a mode which can stop and resume after committing data, allowing it to make progress on lots of data without running into memory limits.

This is the current reason for the threshold. While it's somewhat surprising to me that the algorithm is performant enough when running partly on swap, it makes sense that it's faster than clean root calculation @godsflaw

I think the path forward here is for us to implement a way for the "incremental" calculation, to be actually incremental, and make progress while committing to DB and avoiding OOMs.

Apr 22 '25 20:04 Rjected

We're experiencing the same issue, and applying the 30k patch didn't help in our case. We're constantly lagging 7k blocks behind, with the MerkleExecute stage taking ~3-4h per run.

Is there any way to easily determine if the node is actually using the incremental calculation, or still applying the clean root calc?

Apr 29 '25 09:04 phil-k1

I just experienced the same issue on an Epyc 7773X with 4x SN850X 8TB NVMe drives. The hardware should be more than enough to handle any blockchain node.

When I was using the default config from the Base RPC snapshot on docs.base.org - the sync would be stuck at ~5000 behind live, with MerkleExecution taking 3 hours - which is also about 5000 blocks.

I fixed the issue by changing the following settings in reth.toml

[stages.merkle]
clean_threshold = 30000

After that, the ~5000 block MerkleExecution stage that previous took 3 hours, finished in 8 minutes. And the node was able to catch up to live quickly after that.

I looked into the code, and I actually don't think changing MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD would fix it. MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD is only used in the debug merkle command, and in particular, MerkleStage::default_execution() - that's not what's being used during node sync.

Instead, I think we should target the default setting passed to MerkleStage::new_execution() in crate/stages/stages/src/sets.rs. The default settings for this comes from https://github.com/paradigmxyz/reth/blob/4a6b2837e6c1662d2b21a06638b5e1228d986c6e/crates/config/src/config.rs#L322.

May 03 '25 00:05 martinkou

I fixed the issue by changing the following settings in reth.toml

I also noticed this after unsuccessfully setting MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD in the code, and increasing the clean_threshold to 30k finally let our nodes sync as well.

In the process I however noticed another item that I would consider a bug:

When restarting a node that was doing the clean root calc with the modified setting, it would continue where it left off, but then not doing the incremental sync only on the range between snapshot and target (as used across all other stages), but between latest root calc snapshot and target, i.e. a potentially much larger block range than the specified 30k blocks (at least that's my interpretation without crunching through the details in the code). In one instance, this resulted in the node ending up in an OOM loop.

Think it would be better to consider the current MerkleExecute range on restarts & reevaluate which method should be applied, rather than only looking at the range to be covered by the stages.

May 04 '25 16:05 phil-k1

I fixed the issue by changing the following settings in reth.toml

I also noticed this after unsuccessfully setting MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD in the code, and increasing the clean_threshold to 30k finally let our nodes sync as well.

In the process I however noticed another item that I would consider a bug:

When restarting a node that was doing the clean root calc with the modified setting, it would continue where it left off, but then not doing the incremental sync only on the range between snapshot and target (as used across all other stages), but between latest root calc snapshot and target, i.e. a potentially much larger block range than the specified 30k blocks (at least that's my interpretation without crunching through the details in the code). In one instance, this resulted in the node ending up in an OOM loop.

Think it would be better to consider the current MerkleExecute range on restarts & reevaluate which method should be applied, rather than only looking at the range to be covered by the stages.

I am also in OOM loop. Tried both MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD and clean_threshold fixes.

Note that I never reached the end of stages. It crashed OOM at MerkleExecute, while syncing from th official snapshot from Base website.

Last log:

{"timestamp":"2025-05-14T05:29:13.099333Z","level":"INFO","fields":{"message":"Status","connected_peers":33,"stage":"MerkleExecute","checkpoint":30078351,"target":"30190674"}}

i3en.2xlarge - 64GB ram and local NVME reth v1.3.12 Base blockchain

@phil-k1 Did you manage to fix the OOM loop?

May 14 '25 05:05 tonisives

Chiming in to comment that I was affected by this issue in Base as well. Fixed by:

[stages.merkle]
clean_threshold = 30000

May 27 '25 17:05 snf

@phil-k1 Did you manage to fix the OOM loop?

The fix was to set the MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD / clean_threshold back to default, let the current iteration of MerkleExecute complete, then shut down the node before the next MerkleExecute iteration is started & set clean_threshold = 30000 again.

Jun 03 '25 20:06 phil-k1

BTW we're working on this here: https://github.com/paradigmxyz/reth/pull/16178

Which should allow people to make progress on merkle stage on base with no oom loops and a much higher clean threshold

Jun 03 '25 21:06 Rjected

BTW we're working on this here: #16178

Which should allow people to make progress on merkle stage on base with no oom loops and a much higher clean threshold

when's this releasing? can test on our replicas

Jun 18 '25 18:06 exp0nge

when's this releasing? can test on our replicas

@exp0nge likely early next week, testing would be greatly appreciated! We've only tested on our own boxes which are fairly fast, so it would be great for people who are experiencing this now to check that main now fixes the issue.

Jun 18 '25 18:06 Rjected

when's this releasing? can test on our replicas

@exp0nge likely early next week, testing would be greatly appreciated! We've only tested on our own boxes which are fairly fast, so it would be great for people who are experiencing this now to check that main now fixes the issue.

I'm trying this on:

i7ie.12xlarge
Memory Limit + Requests: 128Gi
CPU Requests: 800m
Local disk
Build commit: 67e3c11
Args:

op-reth node -vvv --log.stdout.format log-fmt --ws --ws.port=8546 --ws.addr=0.0.0.0 --ws.origins=* --ws.api=debug,eth,net,txpool --http --http.port=8545 --http.addr=0.0.0.0 --http.corsdomain=* --authrpc.addr=0.0.0.0 --authrpc.jwtsecret=/mnt/secrets/jwt --authrpc.port=8551 --verbosity --rollup.disable-tx-pool-gossip --port=30303 --rollup.sequencer-http=https://mainnet-sequencer.base.org --datadir=/mnt/data/datadir/reth --chain=base --http.api=eth,net,debug,net,txpool --metrics=0.0.0.0:6060 --rollup.discovery.v4

I get logs around (trailing ~2 hrs):

│ ts=2025-06-20T17:14:56.0438006Z level=info target=reth::cli message=Status connected_peers=9 stage=MerkleExecute checkpoint=31808643 target=31824255                                                                                                 │

There was 1 OOMKilled event, so I will try more memory, but what is the expected target?

Jun 20 '25 17:06 exp0nge

@exp0nge hmm, there should be no OOM event on 128G of ram, this should take max 12G, recommended 16G. Can you show any grafana dashboards, specifically the jemalloc memory section, the Sync progress (stage progress in %) and Sync progress (stage progress as highest block number reached) sections? As well as any debug logs from around that time, especially sync::stages::merkle::exec

Jun 20 '25 17:06 Rjected

Could you also share the reth.toml?

Jun 20 '25 17:06 Rjected

Don't have all the stuff you need, but:

Could you also share the reth.toml?

reth.toml.txt

Jun 20 '25 18:06 exp0nge

@exp0nge thanks, I think that memory usage chart may be misleading due to our usage of mmap, ie, I'm not sure that when it reaches 128G, that it would always cause an OOM. But thank you for the info. Just to confirm, outside of the OOM, is the node making progress? How does it compare to before?

Jun 20 '25 18:06 Rjected

@exp0nge thanks, I think that memory usage chart may be misleading due to our usage of mmap, ie, I'm not sure that when it reaches 128G, that it would always cause an OOM. But thank you for the info. Just to confirm, outside of the OOM, is the node making progress? How does it compare to before?

This replica was not making progress fast enough. It always ended up lingering ~2 hours behind. However, another replica I have did make progress even before this PR so I just reset the 'bad' one to see if this same issue persist using the PR you provided.

Jun 20 '25 19:06 exp0nge

@exp0nge sounds good, keep me up to date!

Jun 20 '25 20:06 Rjected

I have seen the OOM issue on earlier versions of reth when the SSD is too slow. Something builds up and doesn't clear like it should. Don't think your i7i instance will cut it. Check actual read/write speeds. Can recommend dedicated instances at Hetzner or Ovh instead. You get proper NVMe's there.

Jun 30 '25 17:06 CRossel87a

Syncing Base archive nodes - MERKLE_STAGE_DEFAULT_CLEAN_THRESHOLD potentially too low

Describe the bug

Steps to reproduce

Node logs

Platform(s)

Container Type

What version/commit are you on?

What database version are you on?

Which chain / network are you on?

What type of node are you running?

What prune config do you use, if any?

If you've built Reth from source, provide the full command you used

Code of Conduct