erigon icon indicating copy to clipboard operation
erigon copied to clipboard

[bloatnet] experiment: change StepSize

Open taratorio opened this issue 4 months ago • 50 comments

  • tool to allow us to change our snapshot files from 1 step size to another - can be achieved just by renaming the files (as long as step sizes are multiples of 2) and deleting chaindata
  • can be fully automated ie on startup we can have a flag for step size -> detect what our current step size is in existing files -> apply the necessary renames -> delete chaindata -> continue
  • this should make it easy to experiment with currentStepSize/2, currentStepSize/4, currentStepSize/8, currentStepSize/16, etc. and gathering metrics to assess performance differences

taratorio avatar Aug 21 '25 16:08 taratorio

taratorio added Imp2 as Importance

VBulikov avatar Aug 21 '25 16:08 VBulikov

some status update here:

  • did the initial script, that was the easy part.
  • besides renaming, it requires some small adjustments to constants in config3.go.

weird stuff that needs to be done:

  • "delete chaindata" seems to be broken somehow in recent main: https://github.com/erigontech/erigon/issues/17383
    • unrelated to this one, but made me waste some time. need to track it down anyway bc rebasing steps requires cleaning up chaindata
  • running against old code the execution seems to restart from 0, which means there may be additional hardcoded assumptions in code about step geometry that I'm going to debug.

wmitsuda avatar Oct 08 '25 12:10 wmitsuda

there is actually a limitation to this approach with the current default step size because it is defined as 1_562_500; it can be divided by 4, but not by 8, etc...

we can use the rename approach to, let's say, divide the step by 20, so 1 becomes 20 in the filenames, but our "piramid-merging" algorithm may go crazy.

I actually want to divide by 20 on bloatnet experiment since the gas limit there went 30M -> 500M, let me see if we can ignore background merging eventual errors, but we may want to revisit the "piramid-merging" algorithm or make a "not-just-rename-but-total-rewrite" step rebasing implementation.

wmitsuda avatar Oct 09 '25 20:10 wmitsuda

I wrote my conclusions for the first round of tests: https://hackmd.io/@wmitsuda/BkFmxXL6ge

TLDR; I didn't notice improvement on execution time; aggressive step size reduction was effective in limiting chaindata growth though.

happy to hear about improvements in the methodology if we want to do another round of tests, or we consider this done for bloatnet purposes.

cc: @AskAlexSharov

wmitsuda avatar Oct 10 '25 06:10 wmitsuda

I wrote my conclusions for the first round of tests: https://hackmd.io/@wmitsuda/BkFmxXL6ge

TLDR; I didn't notice improvement on execution time; aggressive step size reduction was effective in limiting chaindata growth though.

happy to hear about improvements in the methodology if we want to do another round of tests, or we consider this done for bloatnet purposes.

cc: @AskAlexSharov

I see in the experiments you only run for ~1700 blocks? For me that was not enough to see the bloat and drastic degradation. The bloat used to happen after letting the node run for ~1 day which is a lot more than 1700 blocks. Also it really really depends which blocks you executed from the bloated ones. I suggest to execute all from 22600001 (the fork point or earlier doesn't have to be exact) to 22877864 (slot https://dora.perf-devnet-2.ethpandaops.io/slot/80112) which is where the bloating stopped (as per the screenshot below) - if you go on Dora you will see that all blocks before it are pretty much at 500MGas. Also curious what --loop.sync.block.limit did you use in your experiment?

Image

taratorio avatar Oct 10 '25 06:10 taratorio

oh, nice.

worst case of chaindata size get visible after pruning couple steps from db (after couple new files production)

my head expected 2x reduction on stepSize/2 :-) plz add to hackmd (to any case):

./build/bin/mdbx_stat -efa /erigon-data/chaindata/   | awk '
    BEGIN { pagesize = 4096 }
    /^  Pagesize:/ { pagesize = $2 }
    /^Status of/ { table = $3 }
    /Branch pages:/ { branch = $3 }
    /Leaf pages:/ { leaf = $3 }
    /Overflow pages:/ { overflow = $3 }
    /Entries:/ {
      total_pages = branch + leaf + overflow
      size_gb = (total_pages * pagesize) / (1024^3)
      printf "%-30s %.3fG\n", table, size_gb
    }
  ' | grep -v '0.000G'

AskAlexSharov avatar Oct 10 '25 07:10 AskAlexSharov

Also about https://github.com/erigontech/erigon/issues/16765#issuecomment-3388534767, I remember the chaindata DB would then go up to 500GB when you left it for ~1 day or more. We did some improvements since then about ErrLoopExhausted and EL downloader to adhere to sync.loop.block.limit however I suspect those would not be enough and the step size reduction would help.

taratorio avatar Oct 10 '25 07:10 taratorio

500GB - it's clearly when you coming to dead-loop: DB > RAM -> mdbx disabling ReadAhead -> prune get slower -> amount of steps in db grow with time But I would advise to skip this corner-case currently and assume chaindata fits in RAM. (because likely we can reduce db size: smaller step, optimized schema, compression of commitment history in db, write non-reorgable data outside of db, etc... and if db > ram then rm -rf datadir/chaindata)

AskAlexSharov avatar Oct 10 '25 08:10 AskAlexSharov

500GB - it's clearly when you coming to dead-loop: DB > RAM -> mdbx disabling ReadAhead -> prune get slower -> amount of steps in db grow with time But I would advise to skip this corner-case currently and assume chaindata fits in RAM. (because likely we can reduce db size: smaller step, optimized schema, compression of commitment history in db, write non-reorgable data outside of db, etc... and if db > ram then rm -rf datadir/chaindata)

yes, agree, im just saying that maybe step 2 of the experiment can be to see what happens when we leave it running for longer on the crazy blocks - would we still reach 500GB? if yes why? what causes that?

also about case 3 in the experiment - reducing the step size to 78125 txns - that conflicts with our unwind reorg depth of 512 blocks that we support at the moment in the worst case scenario:

  • a 60MGas block can have in the worst case 60MGas/21,000gas=3000 txns (eth transfers)
  • 78125/3000=26 blocks we will be able to unwind in the worst case
  • because we cant unwind past the frozen files at the moment, we need to make sure we keep enough txns in the DB
  • so for the worst case that means we will need to keep 3000*512=1,536,000 txns in the DB to handle such unwinds (which is pretty much the step size now) - 1,536,000 txns / 78125 step size=20steps that we will have to keep in the DB for unwinding for case 3.
  • worst cases look even worse when we go to 85MGas or 100MGas blocks

taratorio avatar Oct 10 '25 08:10 taratorio

@wmitsuda @taratorio maybe this one is related: https://github.com/erigontech/erigon/issues/16918

AskAlexSharov avatar Oct 10 '25 12:10 AskAlexSharov

@wmitsuda @taratorio maybe this one is related: #16918

yes, I think it is related

stepSize<>maxReorgDepth<>stepsToKeepInDB(dbSize) are 3 variables that are related (it's like yet another trillema)

taratorio avatar Oct 10 '25 12:10 taratorio

Always can deep-reorg by: rm chaindata, erigon snapshots rm-state-files --latest, restart erigon. Maybe it even will be faster :-)

Actually here are couple tricks if we will want to support deep re-orgs:

  • can write non-reorgable data out of mdbx:
    • Example1: Transactions table is AutoIncrement based and re-org doesn't update it (no deletes of recent non-canonical blocks). So, we can write it to AppendOnly file.
    • Example2: if we store non-canonical versions of some data - likely it's also non-reorgable. Receipts of non-canonical blocks can be stored by blockNum+blockHash key - then don't need delete them on re-org.

AskAlexSharov avatar Oct 10 '25 12:10 AskAlexSharov

The point is to not have to do all these manual operations like rm chaindata, erigon snapshots rm-state-files --latest, restart erigon. If a bad event on mainnet occurs and there is a chain split that causes a long reorg that we cant handle then all Erigon nodes will crash. In this case we make the blockchain less secure and do not contribute to it in a good way when it's most needed. I think the problem we have here is that we don't have a clear and reasonable number of how many blocks each clients on the blockchain must support in case of bad events on mainnet. Maybe we should start there and clarify expectations with other client teams (e.g. this https://github.com/erigontech/erigon/issues/17070)

taratorio avatar Oct 10 '25 13:10 taratorio

BTW, I'm rerunning this experiment, 1 each case running for 12 hours (just in case) starting today over the weekend + collecting the additional data Alex asked.

I see in the experiments you only run for ~1700 blocks? For me that was not enough to see the bloat and drastic degradation. The bloat used to happen after letting the node run for ~1 day which is a lot more than 1700 blocks. Also it really really depends which blocks you executed from the bloated ones. I suggest to execute all from 22600001 (the fork point or earlier doesn't have to be exact) to 22877864 (slot https://dora.perf-devnet-2.ethpandaops.io/slot/80112) which is where the bloating stopped (as per the screenshot below) - if you go on Dora you will see that all blocks before it are pretty much at 500MGas. Also curious what --loop.sync.block.limit did you use in your experiment?

I ran them over a backup of the shadowfork already positioned in the bloatnetblock range. Those 1700 blocks were executed in a 30 min timeframe, they are already in the slow blocks, that's why I imagined 30 min would be representative.

I also ran without --loop.sync.block.limit just to see what happened + I was not sure what effect (positive or negative) that would have, so I went just the regular way.

That backup I built by running bloatnet on my machine already had a 300GB chaindata, which I got by running it for several days.

also about case 3 in the experiment - reducing the step size to 78125 txns - that conflicts with our unwind reorg depth of 512 blocks that we support at the moment in the worst case scenario:

  • a 60MGas block can have in the worst case 60MGas/21,000gas=3000 txns (eth transfers)
  • 78125/3000=26 blocks we will be able to unwind in the worst case
  • because we cant unwind past the frozen files at the moment, we need to make sure we keep enough txns in the DB
  • so for the worst case that means we will need to keep 3000*512=1,536,000 txns in the DB to handle such unwinds (which is pretty much the step size now) - 1,536,000 txns / 78125 step size=20steps that we will have to keep in the DB for unwinding for case 3.
  • worst cases look even worse when we go to 85MGas or 100MGas blocks

yeah, I imagined aggressive step size reduction is limited by how far that would allow for unwinds, my point in doing stepsize/20 was just to measure the effects for this experiment. not evaluating if it is doable in practice or not, let's just collect numbers for now.

wmitsuda avatar Oct 10 '25 21:10 wmitsuda

--loop.sync.block.limit has default value. smaller value: means more often flushes to DB and more often prune. more often prune - can reduce db size (pruned pages get available to re-use) more often flush - can increase db size (because updates of InvertedIndex are random. random updates making pages have more free space. % of such free space is configured by MDBX_opt_merge_threshold_16dot16_percent param)

But pruned pages can be re-used only if it's rwtx committed. (actually maybe even couple rwtx.Commit happened - because mdbx has 3 meta-pages).

AskAlexSharov avatar Oct 11 '25 05:10 AskAlexSharov

BTW, I'm rerunning this experiment, 1 each case running for 12 hours (just in case) starting today over the weekend + collecting the additional data Alex asked.

1 more scenario I need to finish for the 2nd round, I should've have it finished by tomorrow.

wmitsuda avatar Oct 13 '25 14:10 wmitsuda

results of round 2: https://hackmd.io/@wmitsuda/B1aTwnwalg

TLDR; stepsize/4 didn't made much difference. stepsize/20 got bigger chaindata and performance went horribly wrong. I wonder if I should redo stepsize/20 again before next round of tests.

For round 3 of tests I'm thinking about do the same as round 2 (12 hours run each scenario) + --loop.sync.block.limit = 16.

wmitsuda avatar Oct 14 '25 22:10 wmitsuda

@wmitsuda CommitmentVals 125.140G - what is integration print_stages stepsInDb? (I guess need use in code rawdbhelpers.IdxStepsCountV3(applyTx) because integration will use wrong step size)

AskAlexSharov avatar Oct 15 '25 02:10 AskAlexSharov

CommitmentVals 66.288G it's not History table! it's Domain table! It's also ~3X bigger than StorageVals 18.608G - which is very useful info...

AskAlexSharov avatar Oct 15 '25 02:10 AskAlexSharov

Your experiment with StepSize doesn't go well - because Prune doesn't prune anything for all step sizes (because chain data > RAM).

So, first we need make sure chain data < 50% RAM. then we can look at 1 Step data size.

Possible options:

  • can manually force pruning erigon snapshots retire command (and then run mdbx_stat to see how much data left in db).
  • can run with smaller --loop.sync.block.limit (it will give a chance for more often prune run).
  • but prune must be more aggressive. Now:
quickPruneTimeout := 250 * time.Millisecond
if s.ForwardProgress > cfg.syncCfg.MaxReorgDepth && !cfg.syncCfg.AlwaysGenerateChangesets {
		// (chunkLen is 8Kb) * (1_000 chunks) = 8mb
		// Some blocks on bor-mainnet have 400 chunks of diff = 3mb
		var pruneDiffsLimitOnChainTip = 1_000
		pruneTimeout := quickPruneTimeout
		if s.CurrentSyncCycle.IsInitialCycle {
			pruneDiffsLimitOnChainTip = math.MaxInt
			pruneTimeout = time.Hour
		}

in this code I don't understand:

  • what is the value of s.CurrentSyncCycle.IsInitialCycle if we run with --loop.sync.block.limit=16
  • quickPruneTimeout := 250 * time.Millisecond let's add env variable to increase it
  • pruneDiffsLimitOnChainTip = 1_000 same so, let's be more aggressive with pruning on bloatnet even if ChainTip may suffer a bit - but it will give us cleaner picture of (what chaindata size can be).

AskAlexSharov avatar Oct 15 '25 02:10 AskAlexSharov

@wmitsuda CommitmentVals 125.140G - what is integration print_stages stepsInDb? (I guess need use in code rawdbhelpers.IdxStepsCountV3(applyTx) because integration will use wrong step size)

I don’t have scenario 3 chaindata anymore because I ran it before scenario 2 and didn’t backup, unfortunately

wmitsuda avatar Oct 15 '25 02:10 wmitsuda

@wmitsuda also here is modified command which will also show Garbage Collection table (FreeList, free space in db) :

./build/bin/mdbx_stat -efa ~/data/chiado33_full/chaindata/   | awk '
    BEGIN { pagesize = 4096 }
    /^  Pagesize:/ { pagesize = $2 }
    /^Status of/ { table = $3 }
    /^Garbage Collection/ { table = "GarbageCollection" }
    /Branch pages:/ { branch = $3 }
    /Leaf pages:/ { leaf = $3 }
    /Overflow pages:/ { overflow = $3 }
    /Entries:/ {
      total_pages = branch + leaf + overflow
      size_gb = (total_pages * pagesize) / (1024^3)
      printf "%-30s %.3fG\n", table, size_gb
    }
  '

AskAlexSharov avatar Oct 15 '25 02:10 AskAlexSharov

so, if nobody has an objection, I'm going to do round 3 of this experiment in the next 3 days by:

  • Repeating the scenarios of round 2, 12 hours run every night
  • using --loop.sync.block.limit 16
  • add new mdbx_stat command Alex asked to collected data.

wmitsuda avatar Oct 15 '25 23:10 wmitsuda

  • what is the value of s.CurrentSyncCycle.IsInitialCycle if we run with --loop.sync.block.limit=16
  • quickPruneTimeout := 250 * time.Millisecond let's add env variable to increase it
  • pruneDiffsLimitOnChainTip = 1_000 same so, let's be more aggressive with pruning on bloatnet even if ChainTip may suffer a bit - but it will give us cleaner picture of (what chaindata size can be).

I propose we only only do --loop.sync.block.limit now to collect only that difference. In meantime I'll think about those modifications and next round we can incrementally modify those and compare, so we don't change many params and in the end don't realize which knob affected what.

wmitsuda avatar Oct 15 '25 23:10 wmitsuda

and:

  • rawdbhelpers.IdxStepsCountV3()

AskAlexSharov avatar Oct 15 '25 23:10 AskAlexSharov

  • what is the value of s.CurrentSyncCycle.IsInitialCycle if we run with --loop.sync.block.limit=16
  • quickPruneTimeout := 250 * time.Millisecond let's add env variable to increase it
  • pruneDiffsLimitOnChainTip = 1_000 same so, let's be more aggressive with pruning on bloatnet even if ChainTip may suffer a bit - but it will give us cleaner picture of (what chaindata size can be).

I propose we only only do --loop.sync.block.limit now to collect only that difference. In meantime I'll think about those modifications and next round we can incrementally modify those and compare, so we don't change many params and in the end don't realize which knob affected what.

oke.

AskAlexSharov avatar Oct 15 '25 23:10 AskAlexSharov

and:

  • rawdbhelpers.IdxStepsCountV3()

add to collected data?

wmitsuda avatar Oct 15 '25 23:10 wmitsuda

rawdbhelpers.IdxStepsCountV3 I think can see in monitoring: monitoring.erigon.io but I don't see previous your runs in monitoring.erigon.io - maybe you forgot to set --metrics flags

AskAlexSharov avatar Oct 15 '25 23:10 AskAlexSharov

no, I'm not setting anything, running on my own machine

wmitsuda avatar Oct 15 '25 23:10 wmitsuda

FYI, actually, for round 3 I'm going to first run scenario 1 (no step size changes) twice, 1 using the current commit from MIlen (~2 months ago), then move to current performance branch commit which I updated yesterday (contains all Mark's recent stuff) in order to:

1 - use more recent code 2 - detect regressions 3 - establish a new baseline

if there are no differences, I'll run scenarios 2 and 3 using the recent performance branch.

wmitsuda avatar Oct 16 '25 15:10 wmitsuda