nimbus-eth1 Massive spike in DB size on Goerli

Massive spike in DB size on Goerli

Open jakubgs opened this issue 1 year ago • 12 comments

I've identified a very big spike in disk usage by our nimbus-eth1 node running on Goerli:

It appears the version running at the time was https://github.com/status-im/nimbus-eth1/commit/7f0bc71b65f763c7b5b4a4987b4a0010828a5637:

Jun 29 00:00:05 metal-01.he-eu-hel1.nimbus.eth1 build.sh[2044102]: From https://github.com/status-im/nimbus-eth1
Jun 29 00:00:05 metal-01.he-eu-hel1.nimbus.eth1 build.sh[2044102]:  + 6e9eacb3...694d1c45 getLogs-endpoint -> origin/getLogs-endpoint  (forced update)
Jun 29 00:00:05 metal-01.he-eu-hel1.nimbus.eth1 build.sh[2044114]: HEAD is now at 7f0bc71b add invalidMissingAncestorReOrg test case

This caused the DB to grow up to 1.6 TB:

[email protected]:~ % sudo du -hs /data/nimbus-eth1-goerli-master/data
1.6T    /data/nimbus-eth1-goerli-master/data

Which is over 10x more than a fully synced Geth node on Goerli:

[email protected]:~ % sudo du -hs /docker/nimbus-goerli/node/data 
131G    /docker/nimbus-goerli/node/data

Jul 11 '22 15:07 jakubgs

It doesn't seem like sst files are the issue:

[email protected]:.../nimbus/data % ls | wc -l
25932
[email protected]:.../nimbus/data % ls *.sst | wc -l
25753
[email protected]:.../nimbus/data % ls *.sst | xargs du -hsc | tail -n1
202G	total
[email protected]:.../nimbus/data % du -hs .  
1.6T	.

Jul 11 '22 15:07 jakubgs

But ncdu doesn't seem to show any massive files:

65M	/data/nimbus-eth1-goerli-master/data/shared_goerli_0/nimbus/data/1094254.sst
ncdu 1.14.1 ~ Use the arrow keys to navigate, press ? for help                                                                                                                                                                     
--- /data/nimbus-eth1-goerli-master/data/shared_goerli_0/nimbus/data ----------------------------------
  460.8 MiB [##########]  LOG.old.1656979293418180                                                                                                                                                                                 
  300.4 MiB [######    ]  LOG.old.1656720086681351
  122.1 MiB [##        ]  LOG.old.1657324897762741
   90.0 MiB [#         ]  LOG                     
   70.4 MiB [#         ]  1095853.log
   64.3 MiB [#         ]  1004708.sst
   64.3 MiB [#         ]  999379.sst 
   64.3 MiB [#         ]  1038531.sst
   64.3 MiB [#         ]  1038543.sst

Jul 11 '22 15:07 jakubgs

The ls command shows the same thing essentially:

[email protected]:.../nimbus/data % ls -lS | head 
total 1648954516
-rw-r--r-- 1 nimbus staff  461M Jul  5 00:01 LOG.old.1656979293418180
-rw-r--r-- 1 nimbus staff  301M Jul  1 23:59 LOG.old.1656720086681351
-rw-r--r-- 1 nimbus staff  123M Jul  8 23:59 LOG.old.1657324897762741
-rw-r--r-- 1 nimbus staff   91M Jul 11 15:57 LOG
-rw-r--r-- 1 nimbus staff   65M Jul  8 23:56 1004708.sst
-rw-r--r-- 1 nimbus staff   65M Jul  8 20:43 999379.sst
-rw-r--r-- 1 nimbus staff   65M Jul 10 00:58 1038531.sst
-rw-r--r-- 1 nimbus staff   65M Jul 10 00:58 1038543.sst
-rw-r--r-- 1 nimbus staff   65M Jul 10 04:07 1046388.sst

Jul 11 '22 15:07 jakubgs

Interestingly, we can also see a big spike in open files at the time:

Jul 11 '22 16:07 jakubgs

Running find shows only 395 GB out of the 1.6 TB:

[email protected]:.../nimbus/data % find . | xargs du -hsc | tail
65M	./1095797.sst
61M	./1095798.sst
10M	./1095799.sst
65M	./1095942.sst
65M	./1095943.sst
37M	./1095944.sst
65M	./1095945.sst
65M	./1095946.sst
26M	./1095947.sst
395G	total

This is very weird.

Jul 11 '22 16:07 jakubgs

If I try to copy the contents to the root volume I can see it takes up more then 395 GB:

sudo rsync --progress -aur /data/nimbus-eth1-goerli-master /data_new/

[email protected]:/data_new/.../data % find | xargs du -hsc
424G	.
424G	total

So there's something off with how I'm counting the files size.

Jul 12 '22 11:07 jakubgs

Actually, if I manually add up the sizes of all files using bc it does add up to 1.69 TB:

[email protected]:.../data % echo "scale=2;($(find . -printf '%s+')0)/(10^9)" | bc
1694.15

So there's something wrong with how du and ncdu calculates the total size.

Jul 12 '22 15:07 jakubgs

My suspicion has been that the persistent storage of consensus snapshots could make problems. By default, there is a persistent snapshot for every 1k blocks (I changed that to 4k for the upcoming pull request.) I added some snapshot storage logging in TRACE mode and run against the goerli-shadow-5 network.

As it seems, the impact of reducing the number of persistent snapshots by 300% is negligible for the first 2.2m blocks. The disk storage size showed not much difference for either sample.

Here are the statistics for syncing ~2.2m blocks.

Caching every 1k blocks:

[..]
blockNumber=2214912 nSnaps=2236 snapsTotal=1.14m
blockNumber=2215936 nSnaps=2237 snapsTotal=1.14m
[..]
Persisting blocks fromBlock=2216449 toBlock=2216640
36458496 datadir-nimbus-goerlish/data/nimbus/

Caching every 1k blocks after replacing legacy LRU handler

[..]
blockNumber=2234368 nSnaps=2259 snapsTotal=1.15m
blockNumber=2235392 nSnaps=2260 snapsTotal=1.15m
[..]
Persisting blocks fromBlock=2235649 toBlock=2235840
37627288 datadir-nimbus-goerlish/data/nimbus/

Caching every 4k blocks after replacing legacy LRU handler

[..]
blockNumber=2232320 nSnaps=620 snapsTotal=0.30m
blockNumber=2236416 nSnaps=621 snapsTotal=0.30m
[..]
Persisting blocks fromBlock=2237185 toBlock=2237376
37627288 datadir-nimbus-goerlish/data/nimbus/

Legend:

nSnaps -- number of persistently cached snapshots
snapsTotal -- accumulated snapshot payload (without DB metadata)
Persisting blocks -- logging from persistBlocks() method
datadir-nimbus-goerlish/data/nimbus/ -- storage directory, size (in kibibytes) calculated with du -sk

Jul 21 '22 08:07 mjfh

I'm running out of space on the host. Unless you have objections @mjfh I will purdge the node data to let it re-sync.

Aug 08 '22 09:08 jakubgs

No objections :)

Aug 08 '22 11:08 mjfh

Done. Lets see how it will look after resyncing.

Aug 08 '22 18:08 jakubgs

We got re-synced(I think?) and we're back to 1.6 TB:

[email protected]:~ % df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme2n1    1.9T  1.6T  243G  87% /data

I'm not actually sure if we are synced because the RPC call times out:

[email protected]:~ % /data/nimbus-eth1-goerli-master/rpc.sh eth_syncing
curl: (28) Operation timed out after 10000 milliseconds with 0 bytes received

Aug 18 '22 16:08 jakubgs

nimbus-eth1 nimbus-eth1 copied to clipboard

Massive spike in DB size on Goerli

nimbus-eth1
nimbus-eth1 copied to clipboard