E3: Pruning takes way too much time on a weak machine
So
if err := e.executionPipeline.RunPrune(e.db, tx, initialCycle); err != nil {
return err
}
takes 24 seconds to execute - this causes us to be always 3 blocks behind. it could be due to state pruning, in which case:
pruneTimeout := 250 * time.Millisecond
if s.CurrentSyncCycle.IsInitialCycle {
pruneTimeout = 12 * time.Hour
}
if _, err = tx.(*temporal.Tx).AggTx().(*libstate.AggregatorRoTx).PruneSmallBatches(ctx, pruneTimeout, tx); err != nil { // prune part of retired data, before commit
return err
}
there is some unexpected behaviour here.
There are many things we can do:
- can replace 250ms by heuristic like: 4sec and tx.DirtySpace() < 64mb
- to not fsync prune, fsync only exec. E2 did it. (it's ok to fsync every 2nd rwtx in mdbx)
- E3 has +1 fsync compare to E2: rawdb.WriteLastNewBlockSeen (maybe it impacting chain-tip also)
- re-visit prune: make it do less random-deletes on chain-tip
- delete prefixes instead of individual keys
- reduce chaindata size (InvertedIndexes schema)
- non-nvme drive may benefit from bigger pageSize (fsync will be faster because
- etc...
I will take a look
On slow machine i sometime see:
Prune TxLookup=1.162s Prune Execution=83ms - so will work on more determenistic TxLookup prune
Step1: made PruneTxLookup on chain-tip deterministic by time: https://github.com/erigontech/erigon/pull/12535 Step2: made PruneTxLookup of non-chain-tip more aggressive: https://github.com/erigontech/erigon/pull/12540 must be good-enough for validator on slow machine now
Step3: measured "how much pages get updated on chain-tip by PruneTxLookup": dirty_before=0B dirty_after=13416KB pruned_blks=8 pruned_txs=1249 - 10kb per key - it's expected: 1 random leaf-page + branchNodes + merge of almost-empty pages. Will work on reducing this.