[bloatnet] experiment: change StepSize
- tool to allow us to change our snapshot files from 1 step size to another - can be achieved just by renaming the files (as long as step sizes are multiples of 2) and deleting chaindata
- can be fully automated ie on startup we can have a flag for step size -> detect what our current step size is in existing files -> apply the necessary renames -> delete chaindata -> continue
- this should make it easy to experiment with currentStepSize/2, currentStepSize/4, currentStepSize/8, currentStepSize/16, etc. and gathering metrics to assess performance differences
taratorio added Imp2 as Importance
some status update here:
- did the initial script, that was the easy part.
- besides renaming, it requires some small adjustments to constants in
config3.go.
weird stuff that needs to be done:
- "delete chaindata" seems to be broken somehow in recent main: https://github.com/erigontech/erigon/issues/17383
- unrelated to this one, but made me waste some time. need to track it down anyway bc rebasing steps requires cleaning up chaindata
- running against old code the execution seems to restart from 0, which means there may be additional hardcoded assumptions in code about step geometry that I'm going to debug.
there is actually a limitation to this approach with the current default step size because it is defined as 1_562_500; it can be divided by 4, but not by 8, etc...
we can use the rename approach to, let's say, divide the step by 20, so 1 becomes 20 in the filenames, but our "piramid-merging" algorithm may go crazy.
I actually want to divide by 20 on bloatnet experiment since the gas limit there went 30M -> 500M, let me see if we can ignore background merging eventual errors, but we may want to revisit the "piramid-merging" algorithm or make a "not-just-rename-but-total-rewrite" step rebasing implementation.
I wrote my conclusions for the first round of tests: https://hackmd.io/@wmitsuda/BkFmxXL6ge
TLDR; I didn't notice improvement on execution time; aggressive step size reduction was effective in limiting chaindata growth though.
happy to hear about improvements in the methodology if we want to do another round of tests, or we consider this done for bloatnet purposes.
cc: @AskAlexSharov
I wrote my conclusions for the first round of tests: https://hackmd.io/@wmitsuda/BkFmxXL6ge
TLDR; I didn't notice improvement on execution time; aggressive step size reduction was effective in limiting chaindata growth though.
happy to hear about improvements in the methodology if we want to do another round of tests, or we consider this done for bloatnet purposes.
cc: @AskAlexSharov
I see in the experiments you only run for ~1700 blocks? For me that was not enough to see the bloat and drastic degradation. The bloat used to happen after letting the node run for ~1 day which is a lot more than 1700 blocks. Also it really really depends which blocks you executed from the bloated ones. I suggest to execute all from 22600001 (the fork point or earlier doesn't have to be exact) to 22877864 (slot https://dora.perf-devnet-2.ethpandaops.io/slot/80112) which is where the bloating stopped (as per the screenshot below) - if you go on Dora you will see that all blocks before it are pretty much at 500MGas. Also curious what --loop.sync.block.limit did you use in your experiment?
oh, nice.
worst case of chaindata size get visible after pruning couple steps from db (after couple new files production)
my head expected 2x reduction on stepSize/2 :-) plz add to hackmd (to any case):
./build/bin/mdbx_stat -efa /erigon-data/chaindata/ | awk '
BEGIN { pagesize = 4096 }
/^ Pagesize:/ { pagesize = $2 }
/^Status of/ { table = $3 }
/Branch pages:/ { branch = $3 }
/Leaf pages:/ { leaf = $3 }
/Overflow pages:/ { overflow = $3 }
/Entries:/ {
total_pages = branch + leaf + overflow
size_gb = (total_pages * pagesize) / (1024^3)
printf "%-30s %.3fG\n", table, size_gb
}
' | grep -v '0.000G'
Also about https://github.com/erigontech/erigon/issues/16765#issuecomment-3388534767, I remember the chaindata DB would then go up to 500GB when you left it for ~1 day or more. We did some improvements since then about ErrLoopExhausted and EL downloader to adhere to sync.loop.block.limit however I suspect those would not be enough and the step size reduction would help.
500GB - it's clearly when you coming to dead-loop: DB > RAM -> mdbx disabling ReadAhead -> prune get slower -> amount of steps in db grow with time
But I would advise to skip this corner-case currently and assume chaindata fits in RAM. (because likely we can reduce db size: smaller step, optimized schema, compression of commitment history in db, write non-reorgable data outside of db, etc... and if db > ram then rm -rf datadir/chaindata)
500GB- it's clearly when you coming to dead-loop:DB > RAM -> mdbx disabling ReadAhead -> prune get slower -> amount of steps in db grow with timeBut I would advise to skip this corner-case currently and assume chaindata fits in RAM. (because likely we can reduce db size: smaller step, optimized schema, compression of commitment history in db, write non-reorgable data outside of db, etc... and if db > ram thenrm -rf datadir/chaindata)
yes, agree, im just saying that maybe step 2 of the experiment can be to see what happens when we leave it running for longer on the crazy blocks - would we still reach 500GB? if yes why? what causes that?
also about case 3 in the experiment - reducing the step size to 78125 txns - that conflicts with our unwind reorg depth of 512 blocks that we support at the moment in the worst case scenario:
- a 60MGas block can have in the worst case 60MGas/21,000gas=3000 txns (eth transfers)
- 78125/3000=26 blocks we will be able to unwind in the worst case
- because we cant unwind past the frozen files at the moment, we need to make sure we keep enough txns in the DB
- so for the worst case that means we will need to keep 3000*512=1,536,000 txns in the DB to handle such unwinds (which is pretty much the step size now) - 1,536,000 txns / 78125 step size=20steps that we will have to keep in the DB for unwinding for case 3.
- worst cases look even worse when we go to 85MGas or 100MGas blocks
@wmitsuda @taratorio maybe this one is related: https://github.com/erigontech/erigon/issues/16918
@wmitsuda @taratorio maybe this one is related: #16918
yes, I think it is related
stepSize<>maxReorgDepth<>stepsToKeepInDB(dbSize) are 3 variables that are related (it's like yet another trillema)
Always can deep-reorg by: rm chaindata, erigon snapshots rm-state-files --latest, restart erigon. Maybe it even will be faster :-)
Actually here are couple tricks if we will want to support deep re-orgs:
- can write non-reorgable data out of mdbx:
- Example1: Transactions table is AutoIncrement based and re-org doesn't update it (no deletes of recent non-canonical blocks). So, we can write it to AppendOnly file.
- Example2: if we store non-canonical versions of some data - likely it's also non-reorgable. Receipts of non-canonical blocks can be stored by
blockNum+blockHashkey - then don't need delete them on re-org.
The point is to not have to do all these manual operations like rm chaindata, erigon snapshots rm-state-files --latest, restart erigon. If a bad event on mainnet occurs and there is a chain split that causes a long reorg that we cant handle then all Erigon nodes will crash. In this case we make the blockchain less secure and do not contribute to it in a good way when it's most needed. I think the problem we have here is that we don't have a clear and reasonable number of how many blocks each clients on the blockchain must support in case of bad events on mainnet. Maybe we should start there and clarify expectations with other client teams (e.g. this https://github.com/erigontech/erigon/issues/17070)
BTW, I'm rerunning this experiment, 1 each case running for 12 hours (just in case) starting today over the weekend + collecting the additional data Alex asked.
I see in the experiments you only run for
~1700blocks? For me that was not enough to see the bloat and drastic degradation. The bloat used to happen after letting the node run for ~1 day which is a lot more than 1700 blocks. Also it really really depends which blocks you executed from the bloated ones. I suggest to execute all from 22600001 (the fork point or earlier doesn't have to be exact) to 22877864 (slot https://dora.perf-devnet-2.ethpandaops.io/slot/80112) which is where the bloating stopped (as per the screenshot below) - if you go on Dora you will see that all blocks before it are pretty much at 500MGas. Also curious what --loop.sync.block.limit did you use in your experiment?
I ran them over a backup of the shadowfork already positioned in the bloatnetblock range. Those 1700 blocks were executed in a 30 min timeframe, they are already in the slow blocks, that's why I imagined 30 min would be representative.
I also ran without --loop.sync.block.limit just to see what happened + I was not sure what effect (positive or negative) that would have, so I went just the regular way.
That backup I built by running bloatnet on my machine already had a 300GB chaindata, which I got by running it for several days.
also about case 3 in the experiment - reducing the step size to 78125 txns - that conflicts with our unwind reorg depth of 512 blocks that we support at the moment in the worst case scenario:
- a 60MGas block can have in the worst case 60MGas/21,000gas=3000 txns (eth transfers)
- 78125/3000=26 blocks we will be able to unwind in the worst case
- because we cant unwind past the frozen files at the moment, we need to make sure we keep enough txns in the DB
- so for the worst case that means we will need to keep 3000*512=1,536,000 txns in the DB to handle such unwinds (which is pretty much the step size now) - 1,536,000 txns / 78125 step size=20steps that we will have to keep in the DB for unwinding for case 3.
- worst cases look even worse when we go to 85MGas or 100MGas blocks
yeah, I imagined aggressive step size reduction is limited by how far that would allow for unwinds, my point in doing stepsize/20 was just to measure the effects for this experiment. not evaluating if it is doable in practice or not, let's just collect numbers for now.
--loop.sync.block.limit has default value. smaller value: means more often flushes to DB and more often prune.
more often prune - can reduce db size (pruned pages get available to re-use)
more often flush - can increase db size (because updates of InvertedIndex are random. random updates making pages have more free space. % of such free space is configured by MDBX_opt_merge_threshold_16dot16_percent param)
But pruned pages can be re-used only if it's rwtx committed. (actually maybe even couple rwtx.Commit happened - because mdbx has 3 meta-pages).
BTW, I'm rerunning this experiment, 1 each case running for 12 hours (just in case) starting today over the weekend + collecting the additional data Alex asked.
1 more scenario I need to finish for the 2nd round, I should've have it finished by tomorrow.
results of round 2: https://hackmd.io/@wmitsuda/B1aTwnwalg
TLDR; stepsize/4 didn't made much difference. stepsize/20 got bigger chaindata and performance went horribly wrong. I wonder if I should redo stepsize/20 again before next round of tests.
For round 3 of tests I'm thinking about do the same as round 2 (12 hours run each scenario) + --loop.sync.block.limit = 16.
@wmitsuda CommitmentVals 125.140G - what is integration print_stages stepsInDb? (I guess need use in code rawdbhelpers.IdxStepsCountV3(applyTx) because integration will use wrong step size)
CommitmentVals 66.288G it's not History table! it's Domain table!
It's also ~3X bigger than StorageVals 18.608G - which is very useful info...
Your experiment with StepSize doesn't go well - because Prune doesn't prune anything for all step sizes (because chain data > RAM).
So, first we need make sure chain data < 50% RAM. then we can look at 1 Step data size.
Possible options:
- can manually force pruning
erigon snapshots retirecommand (and then runmdbx_statto see how much data left in db). - can run with smaller
--loop.sync.block.limit(it will give a chance for more often prune run). - but prune must be more aggressive. Now:
quickPruneTimeout := 250 * time.Millisecond
if s.ForwardProgress > cfg.syncCfg.MaxReorgDepth && !cfg.syncCfg.AlwaysGenerateChangesets {
// (chunkLen is 8Kb) * (1_000 chunks) = 8mb
// Some blocks on bor-mainnet have 400 chunks of diff = 3mb
var pruneDiffsLimitOnChainTip = 1_000
pruneTimeout := quickPruneTimeout
if s.CurrentSyncCycle.IsInitialCycle {
pruneDiffsLimitOnChainTip = math.MaxInt
pruneTimeout = time.Hour
}
in this code I don't understand:
- what is the value of
s.CurrentSyncCycle.IsInitialCycleif we run with--loop.sync.block.limit=16 quickPruneTimeout := 250 * time.Millisecondlet's add env variable to increase itpruneDiffsLimitOnChainTip = 1_000same so, let's be more aggressive with pruning on bloatnet even if ChainTip may suffer a bit - but it will give us cleaner picture of (what chaindata size can be).
@wmitsuda
CommitmentVals 125.140G- what isintegration print_stagesstepsInDb? (I guess need use in coderawdbhelpers.IdxStepsCountV3(applyTx)becauseintegrationwill use wrong step size)
I don’t have scenario 3 chaindata anymore because I ran it before scenario 2 and didn’t backup, unfortunately
@wmitsuda also here is modified command which will also show Garbage Collection table (FreeList, free space in db) :
./build/bin/mdbx_stat -efa ~/data/chiado33_full/chaindata/ | awk '
BEGIN { pagesize = 4096 }
/^ Pagesize:/ { pagesize = $2 }
/^Status of/ { table = $3 }
/^Garbage Collection/ { table = "GarbageCollection" }
/Branch pages:/ { branch = $3 }
/Leaf pages:/ { leaf = $3 }
/Overflow pages:/ { overflow = $3 }
/Entries:/ {
total_pages = branch + leaf + overflow
size_gb = (total_pages * pagesize) / (1024^3)
printf "%-30s %.3fG\n", table, size_gb
}
'
so, if nobody has an objection, I'm going to do round 3 of this experiment in the next 3 days by:
- Repeating the scenarios of round 2, 12 hours run every night
- using
--loop.sync.block.limit 16 - add new mdbx_stat command Alex asked to collected data.
- what is the value of
s.CurrentSyncCycle.IsInitialCycleif we run with--loop.sync.block.limit=16quickPruneTimeout := 250 * time.Millisecondlet's add env variable to increase itpruneDiffsLimitOnChainTip = 1_000same so, let's be more aggressive with pruning on bloatnet even if ChainTip may suffer a bit - but it will give us cleaner picture of (what chaindata size can be).
I propose we only only do --loop.sync.block.limit now to collect only that difference. In meantime I'll think about those modifications and next round we can incrementally modify those and compare, so we don't change many params and in the end don't realize which knob affected what.
and:
- rawdbhelpers.IdxStepsCountV3()
- what is the value of
s.CurrentSyncCycle.IsInitialCycleif we run with--loop.sync.block.limit=16quickPruneTimeout := 250 * time.Millisecondlet's add env variable to increase itpruneDiffsLimitOnChainTip = 1_000same so, let's be more aggressive with pruning on bloatnet even if ChainTip may suffer a bit - but it will give us cleaner picture of (what chaindata size can be).I propose we only only do
--loop.sync.block.limitnow to collect only that difference. In meantime I'll think about those modifications and next round we can incrementally modify those and compare, so we don't change many params and in the end don't realize which knob affected what.
oke.
and:
- rawdbhelpers.IdxStepsCountV3()
add to collected data?
rawdbhelpers.IdxStepsCountV3 I think can see in monitoring: monitoring.erigon.io
but I don't see previous your runs in monitoring.erigon.io - maybe you forgot to set --metrics flags
no, I'm not setting anything, running on my own machine
FYI, actually, for round 3 I'm going to first run scenario 1 (no step size changes) twice, 1 using the current commit from MIlen (~2 months ago), then move to current performance branch commit which I updated yesterday (contains all Mark's recent stuff) in order to:
1 - use more recent code 2 - detect regressions 3 - establish a new baseline
if there are no differences, I'll run scenarios 2 and 3 using the recent performance branch.