erigon icon indicating copy to clipboard operation
erigon copied to clipboard

E3: Pruning takes way too much time on a weak machine

Open Giulio2002 opened this issue 1 year ago • 1 comments

So

if err := e.executionPipeline.RunPrune(e.db, tx, initialCycle); err != nil {
				return err
			}

takes 24 seconds to execute - this causes us to be always 3 blocks behind. it could be due to state pruning, in which case:

pruneTimeout := 250 * time.Millisecond
	if s.CurrentSyncCycle.IsInitialCycle {
		pruneTimeout = 12 * time.Hour
	}
	if _, err = tx.(*temporal.Tx).AggTx().(*libstate.AggregatorRoTx).PruneSmallBatches(ctx, pruneTimeout, tx); err != nil { // prune part of retired data, before commit
		return err
	}

there is some unexpected behaviour here.

Giulio2002 avatar Oct 25 '24 22:10 Giulio2002

There are many things we can do:

  • can replace 250ms by heuristic like: 4sec and tx.DirtySpace() < 64mb
  • to not fsync prune, fsync only exec. E2 did it. (it's ok to fsync every 2nd rwtx in mdbx)
  • E3 has +1 fsync compare to E2: rawdb.WriteLastNewBlockSeen (maybe it impacting chain-tip also)
  • re-visit prune: make it do less random-deletes on chain-tip
  • delete prefixes instead of individual keys
  • reduce chaindata size (InvertedIndexes schema)
  • non-nvme drive may benefit from bigger pageSize (fsync will be faster because
  • etc...

I will take a look

AskAlexSharov avatar Oct 26 '24 03:10 AskAlexSharov

On slow machine i sometime see: Prune TxLookup=1.162s Prune Execution=83ms - so will work on more determenistic TxLookup prune

AskAlexSharov avatar Oct 29 '24 02:10 AskAlexSharov

Step1: made PruneTxLookup on chain-tip deterministic by time: https://github.com/erigontech/erigon/pull/12535 Step2: made PruneTxLookup of non-chain-tip more aggressive: https://github.com/erigontech/erigon/pull/12540 must be good-enough for validator on slow machine now

Step3: measured "how much pages get updated on chain-tip by PruneTxLookup": dirty_before=0B dirty_after=13416KB pruned_blks=8 pruned_txs=1249 - 10kb per key - it's expected: 1 random leaf-page + branchNodes + merge of almost-empty pages. Will work on reducing this.

AskAlexSharov avatar Oct 30 '24 08:10 AskAlexSharov