Nethermind 1.34.1 + Teku 25.9.3 from scratch stuck a few days behind toggling between FastSync and SnapSync
Description I am unable to complete full ETH1 resync in the combination from the title. Nethermind does not appear to be making any progress after completing (? at least it reaches high % and stops appearing in logs) State Ranges. It receives most recent ForkChoice, "inserts" new block, but then logs "Changing sync" with older block. Pivot is never advancing.
Steps to Reproduce I use RocketPool smart node, that hosts Nethermind and Teku using docker (nethermind/nethermind:1.34.1). I initiated resync of both. Teku synced quickly using checkpoint and seems to be fine afterwards. Nethermind gets stuck in the problematic state.
Actual behavior CPU stays at 95%+ idle, almost no IO is being made, yet Nethermind does not seem to catch up and continues to report "Changing sync" for each block. If I restart it in this state, it goes back to doing State Ranges over again.
Expected behavior Nethermind should catch up
Screenshots See logs below
Desktop (please complete the following information):
- Operating System: Ubuntu
- Version: 24.04
- Installation Method: Docker
- Consensus Client: teku:25.9.3
Additional context The only sus bit in the logs is that none of the peers are "Active". All are "Sleeping".
Logs
Received New Block: 23571216 (0x75e8ea...c9e634) | limit 45,000,000 👆 | Extra Data: beaverbuild.org Syncing... Inserting block 23571216 (0x75e8ea...c9e634). Snap Remaining storage: ( 97.41 %) [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ] Changing sync SnapSync to FastSync at pivot: 23532483 | header: 23571183 | header: 23571183 | target: 23571216 | peer: 23571215 | state: 0 Received ForkChoice: 23571216 (0x75e8ea...c9e634), Safe: 23571160 (0xe2966f...110380), Finalized: 23571128 (0x04ce2f...9d1994) Changing sync FastSync to SnapSync at pivot: 23532483 | header: 23571184 | header: 23571184 | target: 23571216 | peer: 23571216 | state: 0 Peers: 120 | with best block: 120 | eth66 (1 %), eth67 (3 %), eth68 (68 %), eth69 (28 %) | Active: None | Sleeping: All Snap Remaining storage: ( 97.41 %) [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ] Snap Remaining storage: ( 97.41 %) [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ] Received New Block: 23571217 (0xf8f628...dc0811) | limit 45,000,000 | Extra Data: BuilderNet (Flashbots) Syncing... Inserting block 23571217 (0xf8f628...dc0811). Received ForkChoice: 23571217 (0xf8f628...dc0811), Safe: 23571160 (0xe2966f...110380), Finalized: 23571128 (0x04ce2f...9d1994)
This log basically repeats every time new block is received and there's nothing else in the log at all.
Added full log (it has color codes, so use cat or less -R): Nethermind.txt
State Ranges is done around line 100096. The progress appears to stop around 100587.
Pivot 23532483 (whatever that is) does not appear to progress. This actually happened both times I attempted to sync from scratch. Log contains both attempts. The above line references are for the attempt 2.
Hey @lostmsu, thanks for reaching out, and sorry you are facing that issue!
Would you mind sharing CL logs as well?
Sure: Nethermind2.zip
Thanks for the response! We are looking at it!
Will get back to you as soon as we have something
Hello! any idea how to resolve this? hehe - how can I help contribute?
Yo! @lostmsu @stdevMac
I DID MAKE LOTS OF ASSUMPTIONS HERE THO - and this is for a l2-base node that was stuck
my native geth env is through nethermind-geth natively ran, and then all beacons and l2 nodes are docker containers in my case. im tired, and i may have misunderstood your issue - so sorry if this is off track, but I noticed that 6 days ago there was a performance tuning doc updated by nethermind, so over the last like, idk, 2 hours, I've been banging my head over this trying to solve this - because I've managed to corrupt or out of frustration, delete my previous work like 5 times.
my node had already accumulated like 700 gbs - was just stuck at the end flipping between sync states - idk if this will fix your problem. I didn't want to delete or damage my sync data for the umpteenth time
For anyone still hitting this issue with op-based L2s with Nethermind 1.35.2:
This metadata reset workaround worked for me: - if you forget to stop your nodes, you will for sure corrupt it.
- Stop containers:
docker compose down - Backup metadata:
cp -r /data/nethermind_db/base-mainnet/metadata /data/metadata-backup - Delete metadata:
rm -rf /data/nethermind_db/base-mainnet/metadata - Restart:
docker compose up -d
In my case this seemed to force Nethermind to rebuild sync state from existing chain data (keeps your 700GB+ download). Also recommend adding these performance flags:
--Sync.SnapSyncAccountRangePartitionCount=8 --Sync.TuneDbMode=DisableCompaction --Network.MaxOutgoingConnectPerSec=50 --Pruning.CacheMb=2000 --Db.StateDbWriteBufferSize=100000000
Node went from oscillating at old pivot (38319681) to syncing properly with current pivot (38500xxx+) in minutes.
hope this helps
update
Overnight it continued to reindex everything i had, and i realized i hadn't made any progress - fearing the same issue, i kind of went scorched earth on my node and restarted fresh with the new performance flags. so far, SUPER GOOD - but I do think, that you will ultimately need to purge your old data OR let it run for like, weeks to see if it fixes itself. haah