nethermind icon indicating copy to clipboard operation
nethermind copied to clipboard

Nethermind 1.34.1 + Teku 25.9.3 from scratch stuck a few days behind toggling between FastSync and SnapSync

Open lostmsu opened this issue 2 months ago • 6 comments

Description I am unable to complete full ETH1 resync in the combination from the title. Nethermind does not appear to be making any progress after completing (? at least it reaches high % and stops appearing in logs) State Ranges. It receives most recent ForkChoice, "inserts" new block, but then logs "Changing sync" with older block. Pivot is never advancing.

Steps to Reproduce I use RocketPool smart node, that hosts Nethermind and Teku using docker (nethermind/nethermind:1.34.1). I initiated resync of both. Teku synced quickly using checkpoint and seems to be fine afterwards. Nethermind gets stuck in the problematic state.

Actual behavior CPU stays at 95%+ idle, almost no IO is being made, yet Nethermind does not seem to catch up and continues to report "Changing sync" for each block. If I restart it in this state, it goes back to doing State Ranges over again.

Expected behavior Nethermind should catch up

Screenshots See logs below

Desktop (please complete the following information):

  • Operating System: Ubuntu
  • Version: 24.04
  • Installation Method: Docker
  • Consensus Client: teku:25.9.3

Additional context The only sus bit in the logs is that none of the peers are "Active". All are "Sleeping".

Logs

Received New Block: 23571216 (0x75e8ea...c9e634) | limit 45,000,000 👆 | Extra Data: beaverbuild.org Syncing... Inserting block 23571216 (0x75e8ea...c9e634). Snap Remaining storage: ( 97.41 %) [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ] Changing sync SnapSync to FastSync at pivot: 23532483 | header: 23571183 | header: 23571183 | target: 23571216 | peer: 23571215 | state: 0 Received ForkChoice: 23571216 (0x75e8ea...c9e634), Safe: 23571160 (0xe2966f...110380), Finalized: 23571128 (0x04ce2f...9d1994) Changing sync FastSync to SnapSync at pivot: 23532483 | header: 23571184 | header: 23571184 | target: 23571216 | peer: 23571216 | state: 0 Peers: 120 | with best block: 120 | eth66 (1 %), eth67 (3 %), eth68 (68 %), eth69 (28 %) | Active: None | Sleeping: All Snap Remaining storage: ( 97.41 %) [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ] Snap Remaining storage: ( 97.41 %) [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ] Received New Block: 23571217 (0xf8f628...dc0811) | limit 45,000,000 | Extra Data: BuilderNet (Flashbots) Syncing... Inserting block 23571217 (0xf8f628...dc0811). Received ForkChoice: 23571217 (0xf8f628...dc0811), Safe: 23571160 (0xe2966f...110380), Finalized: 23571128 (0x04ce2f...9d1994)

This log basically repeats every time new block is received and there's nothing else in the log at all.

lostmsu avatar Oct 13 '25 20:10 lostmsu

Added full log (it has color codes, so use cat or less -R): Nethermind.txt

State Ranges is done around line 100096. The progress appears to stop around 100587.

Pivot 23532483 (whatever that is) does not appear to progress. This actually happened both times I attempted to sync from scratch. Log contains both attempts. The above line references are for the attempt 2.

lostmsu avatar Oct 13 '25 21:10 lostmsu

Hey @lostmsu, thanks for reaching out, and sorry you are facing that issue!

Would you mind sharing CL logs as well?

stdevMac avatar Oct 16 '25 11:10 stdevMac

Sure: Nethermind2.zip

lostmsu avatar Oct 17 '25 17:10 lostmsu

Thanks for the response! We are looking at it!

Will get back to you as soon as we have something

stdevMac avatar Oct 20 '25 11:10 stdevMac

Hello! any idea how to resolve this? hehe - how can I help contribute?

kolonelpanik avatar Nov 22 '25 03:11 kolonelpanik

Yo! @lostmsu @stdevMac

I DID MAKE LOTS OF ASSUMPTIONS HERE THO - and this is for a l2-base node that was stuck

my native geth env is through nethermind-geth natively ran, and then all beacons and l2 nodes are docker containers in my case. im tired, and i may have misunderstood your issue - so sorry if this is off track, but I noticed that 6 days ago there was a performance tuning doc updated by nethermind, so over the last like, idk, 2 hours, I've been banging my head over this trying to solve this - because I've managed to corrupt or out of frustration, delete my previous work like 5 times.

my node had already accumulated like 700 gbs - was just stuck at the end flipping between sync states - idk if this will fix your problem. I didn't want to delete or damage my sync data for the umpteenth time

For anyone still hitting this issue with op-based L2s with Nethermind 1.35.2:

This metadata reset workaround worked for me: - if you forget to stop your nodes, you will for sure corrupt it.

  1. Stop containers: docker compose down
  2. Backup metadata: cp -r /data/nethermind_db/base-mainnet/metadata /data/metadata-backup
  3. Delete metadata: rm -rf /data/nethermind_db/base-mainnet/metadata
  4. Restart: docker compose up -d

In my case this seemed to force Nethermind to rebuild sync state from existing chain data (keeps your 700GB+ download). Also recommend adding these performance flags:

--Sync.SnapSyncAccountRangePartitionCount=8 --Sync.TuneDbMode=DisableCompaction --Network.MaxOutgoingConnectPerSec=50 --Pruning.CacheMb=2000 --Db.StateDbWriteBufferSize=100000000

Node went from oscillating at old pivot (38319681) to syncing properly with current pivot (38500xxx+) in minutes.

hope this helps

update

Overnight it continued to reindex everything i had, and i realized i hadn't made any progress - fearing the same issue, i kind of went scorched earth on my node and restarted fresh with the new performance flags. so far, SUPER GOOD - but I do think, that you will ultimately need to purge your old data OR let it run for like, weeks to see if it fixes itself. haah

kolonelpanik avatar Nov 22 '25 05:11 kolonelpanik