nimbus-eth1
nimbus-eth1 copied to clipboard
Massive spike in DB size on Goerli
I've identified a very big spike in disk usage by our nimbus-eth1
node running on Goerli:
It appears the version running at the time was https://github.com/status-im/nimbus-eth1/commit/7f0bc71b65f763c7b5b4a4987b4a0010828a5637:
Jun 29 00:00:05 metal-01.he-eu-hel1.nimbus.eth1 build.sh[2044102]: From https://github.com/status-im/nimbus-eth1
Jun 29 00:00:05 metal-01.he-eu-hel1.nimbus.eth1 build.sh[2044102]: + 6e9eacb3...694d1c45 getLogs-endpoint -> origin/getLogs-endpoint (forced update)
Jun 29 00:00:05 metal-01.he-eu-hel1.nimbus.eth1 build.sh[2044114]: HEAD is now at 7f0bc71b add invalidMissingAncestorReOrg test case
This caused the DB to grow up to 1.6 TB:
[email protected]:~ % sudo du -hs /data/nimbus-eth1-goerli-master/data
1.6T /data/nimbus-eth1-goerli-master/data
Which is over 10x more than a fully synced Geth node on Goerli:
[email protected]:~ % sudo du -hs /docker/nimbus-goerli/node/data
131G /docker/nimbus-goerli/node/data
It doesn't seem like sst
files are the issue:
[email protected]:.../nimbus/data % ls | wc -l
25932
[email protected]:.../nimbus/data % ls *.sst | wc -l
25753
[email protected]:.../nimbus/data % ls *.sst | xargs du -hsc | tail -n1
202G total
[email protected]:.../nimbus/data % du -hs .
1.6T .
But ncdu
doesn't seem to show any massive files:
65M /data/nimbus-eth1-goerli-master/data/shared_goerli_0/nimbus/data/1094254.sst
ncdu 1.14.1 ~ Use the arrow keys to navigate, press ? for help
--- /data/nimbus-eth1-goerli-master/data/shared_goerli_0/nimbus/data ----------------------------------
460.8 MiB [##########] LOG.old.1656979293418180
300.4 MiB [###### ] LOG.old.1656720086681351
122.1 MiB [## ] LOG.old.1657324897762741
90.0 MiB [# ] LOG
70.4 MiB [# ] 1095853.log
64.3 MiB [# ] 1004708.sst
64.3 MiB [# ] 999379.sst
64.3 MiB [# ] 1038531.sst
64.3 MiB [# ] 1038543.sst
The ls
command shows the same thing essentially:
[email protected]:.../nimbus/data % ls -lS | head
total 1648954516
-rw-r--r-- 1 nimbus staff 461M Jul 5 00:01 LOG.old.1656979293418180
-rw-r--r-- 1 nimbus staff 301M Jul 1 23:59 LOG.old.1656720086681351
-rw-r--r-- 1 nimbus staff 123M Jul 8 23:59 LOG.old.1657324897762741
-rw-r--r-- 1 nimbus staff 91M Jul 11 15:57 LOG
-rw-r--r-- 1 nimbus staff 65M Jul 8 23:56 1004708.sst
-rw-r--r-- 1 nimbus staff 65M Jul 8 20:43 999379.sst
-rw-r--r-- 1 nimbus staff 65M Jul 10 00:58 1038531.sst
-rw-r--r-- 1 nimbus staff 65M Jul 10 00:58 1038543.sst
-rw-r--r-- 1 nimbus staff 65M Jul 10 04:07 1046388.sst
Interestingly, we can also see a big spike in open files at the time:
Running find
shows only 395 GB out of the 1.6 TB:
[email protected]:.../nimbus/data % find . | xargs du -hsc | tail
65M ./1095797.sst
61M ./1095798.sst
10M ./1095799.sst
65M ./1095942.sst
65M ./1095943.sst
37M ./1095944.sst
65M ./1095945.sst
65M ./1095946.sst
26M ./1095947.sst
395G total
This is very weird.
If I try to copy the contents to the root volume I can see it takes up more then 395 GB:
sudo rsync --progress -aur /data/nimbus-eth1-goerli-master /data_new/
[email protected]:/data_new/.../data % find | xargs du -hsc
424G .
424G total
So there's something off with how I'm counting the files size.
Actually, if I manually add up the sizes of all files using bc
it does add up to 1.69 TB:
[email protected]:.../data % echo "scale=2;($(find . -printf '%s+')0)/(10^9)" | bc
1694.15
So there's something wrong with how du
and ncdu
calculates the total size.
My suspicion has been that the persistent storage of consensus snapshots could make problems. By default, there is a persistent snapshot for every 1k blocks (I changed that to 4k for the upcoming pull request.) I added some snapshot storage logging in TRACE mode and run against the goerli-shadow-5
network.
As it seems, the impact of reducing the number of persistent snapshots by 300% is negligible for the first 2.2m blocks. The disk storage size showed not much difference for either sample.
Here are the statistics for syncing ~2.2m blocks.
Caching every 1k blocks:
[..]
blockNumber=2214912 nSnaps=2236 snapsTotal=1.14m
blockNumber=2215936 nSnaps=2237 snapsTotal=1.14m
[..]
Persisting blocks fromBlock=2216449 toBlock=2216640
36458496 datadir-nimbus-goerlish/data/nimbus/
Caching every 1k blocks after replacing legacy LRU handler
[..]
blockNumber=2234368 nSnaps=2259 snapsTotal=1.15m
blockNumber=2235392 nSnaps=2260 snapsTotal=1.15m
[..]
Persisting blocks fromBlock=2235649 toBlock=2235840
37627288 datadir-nimbus-goerlish/data/nimbus/
Caching every 4k blocks after replacing legacy LRU handler
[..]
blockNumber=2232320 nSnaps=620 snapsTotal=0.30m
blockNumber=2236416 nSnaps=621 snapsTotal=0.30m
[..]
Persisting blocks fromBlock=2237185 toBlock=2237376
37627288 datadir-nimbus-goerlish/data/nimbus/
Legend:
- nSnaps -- number of persistently cached snapshots
- snapsTotal -- accumulated snapshot payload (without DB metadata)
- Persisting blocks -- logging from
persistBlocks()
method - datadir-nimbus-goerlish/data/nimbus/ -- storage directory, size (in kibibytes) calculated with
du -sk
I'm running out of space on the host. Unless you have objections @mjfh I will purdge the node data to let it re-sync.
No objections :)
Done. Lets see how it will look after resyncing.
We got re-synced(I think?) and we're back to 1.6 TB:
[email protected]:~ % df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/nvme2n1 1.9T 1.6T 243G 87% /data
I'm not actually sure if we are synced because the RPC call times out:
[email protected]:~ % /data/nimbus-eth1-goerli-master/rpc.sh eth_syncing
curl: (28) Operation timed out after 10000 milliseconds with 0 bytes received