go-spacemesh icon indicating copy to clipboard operation
go-spacemesh copied to clipboard

Extremely slow startup after restarting node during syncing from scratch

Open poszu opened this issue 8 months ago • 1 comments

Description

The node starts very slowly after its restarted during syncing from genesis. It hangs on warming up the ATX in-memory cache because it loads all ATXs from the genesis into memory.

💡 The cache warmup code loads all ATXs starting from the last applied epoch. In the case of syncing from genesis its the epoch 0 because we don't sync and apply layers until ATX sync is completed.

💡 To fix it, the code could sync layers in epoch X immediately once all ATXs for epochs 0 - X are synced.

Steps to reproduce

  1. start a fresh node to start syncing from genesis
  2. wait until it syncs few epochs (e.g. 15)
  3. stop it
  4. start it again

Actual Behavior

Node start hangs on warmup

Expected Behavior

Node should warmup quickly and continue to sync

Environment

irrelevant

Additional Resources

none

poszu avatar Mar 24 '25 09:03 poszu

The problem not only manifests when restarting a node that is syncing from genesis. The underlying problem is that layers aren't applied until the node considers itself "ATX synced". All ATXs published in epochs since the epoch of last applied layer (in the case of sync from genesis 0).

I believe the code that needs to be changed is this section here: https://github.com/spacemeshos/go-spacemesh/blob/a0d02ff81d7c06c9c2804ecac4937d2c1444a571/syncer/state_syncer.go#L34-L41

Instead of just waiting until all ATXs are synced and then processing all layers wait until at least one more epoch is synced than s.getLastSyncedLayer() and process those layers.

Possibly other places in the code need to be adjusted as well.

fasmat avatar Mar 24 '25 10:03 fasmat