nimbus-eth1 resource usage during sync

resource usage during sync

Open stefantalpalaru opened this issue 5 years ago • 17 comments

SVG graph with per-process statistics provided by pidstat (missing the network activity, for now, but still interesting):

That CPU usage over 100% must come from the multi-threaded rocksdb library.

Mar 05 '19 11:03 stefantalpalaru

interesting, disk write pattern shows there is still room for improvements.

Mar 05 '19 12:03 jangko

Another run of ./build/nimbus --prune:archive --port:30304, with network traffic added (using nethogs), better colours and some variables drawn as areas instead of lines.

Full dataset, one second per pixel:

Five seconds per pixel, to better see the memory leak:

Except for the short RocksDB spikes every 6-7 minutes or so, most of the time is spent waiting for data from the network or maxing out a CPU core while processing that data. It all looks very serialised, which means it will benefit from parallelisation.

The average download speed is extremely low, at 7.83 kB/s. Disk I/O is a non-issue right now, on this SSD I'm using.

Mar 06 '19 02:03 stefantalpalaru

is the keep climbing up red-line-RSS an indicator of memory leak? if yes, that is very bad.

Mar 06 '19 02:03 jangko

Yep: https://en.wikipedia.org/wiki/Resident_set_size When it drops, thats's a garbage collection. I don't see a legitimate reason to hang on to so much data in RAM, during execution, so my guess is that the upward trend is due to a memory leak.

What's weird is that even the stack keeps growing, albeit much slower.

Mar 06 '19 02:03 stefantalpalaru

which region of blocks do you sync? I mean the block number. I noticed during block 600K to 700K, memory consumption is very high then stable at block 800K to 900K. I think I will do some measurement myself to improve block sync speed.

@stefantalpalaru: can you share some script with me? how did you produce that svg?

Mar 06 '19 02:03 jangko

which region of blocks do you sync? I mean the block number. I noticed during block 600K to 700K, memory consumption is very high then stable at block 800K to 900K.

I started with an empty db and let it run until it crashed due to an assert in transaction rollback (vendor/nim-eth/eth/trie/db.nim:145 - "doAssert t.db.mostInnerTransaction == t and t.state == Pending").

I don't see block numbers in the output log, because those are logged at the TRACE level which is not included by default.

@stefantalpalaru: can you share some script with me? how did you produce that svg?

Freshly published: https://github.com/status-im/process-stats

Mar 06 '19 14:03 stefantalpalaru

thank you very much.

Mar 06 '19 15:03 jangko

The backend database contribute significantly to block syncing speed. When the database size already reach 20GB+, it become slower and slower because rocksdb doing background compaction. Writing to the database seems not slowing down because of WAL(write ahead layer) mechanism, but reading from database can be really-really slow when it competes with compaction.

at 50GB+ (900K blocks), it become very slow. my current solution is: I created separate databases on separate physical drives. every time I have synced around or near 20GB, I move the database to drive A, and open it as read only database.

when a database opened as read only on drive A, it will doing compaction faster because it does not have to compete with regular read write operation on drive B.

without this poor man sharding, the drive activity will always 100%, while doing this simple sharding, the disk activity on both drive A or B only less than 30%.

for comparison, using single db, to sync 1.4M blocks will take many hours. but when using several 20GB dbs, it will take less than one hour.

Mar 19 '19 02:03 jangko

Thanks for sharing this, @jangko. BTW, how does the lmdb performance compare to rocksdb?

Mar 19 '19 08:03 zah

I stopped using it because it is slower compared to rocksdb when still syncing below 100K blocks, don't know the performance if it contains more data.

Mar 19 '19 09:03 jangko

Would it be possible to actually use this approach of "poor man's sharding" as a solution? Maybe divide data into 10GB snapshots, each snapshot can be one such shard i.e. one rocksdb database, and then use those same snapshots to retrieve data across the network for faster sync among Nimbus clients?

Mar 20 '19 13:03 Swader

https://www.zeroknowledge.fm/9 - interview with one of the parity devs about how they're tuning rocksdb

Mar 20 '19 13:03 arnetheduck

A look at allocated RAM (RSS) versus heap usage according to the GC:

To get these heap stats, I added at the end of persistBlocks(), in nimbus/p2p/chain.nim:

  dumpNumberOfInstances()
  echo "===", getTime().toUnix()

(and an import times above the function)

Nimbus compile flags: make NIMFLAGS="--opt:speed -d:nimTypeNames" nimbus

I ran Nimbus like this: rm -rf ~/.cache/nimbus/db; ./build/nimbus --prune:archive --maxpeers:250 --log-level:trace --log-file:output6.log > heap.txt

I processed "heap.txt" using this quick and dirty script: https://gist.github.com/stefantalpalaru/0b502def452591aaca289ec8fc119e8b

This looks like memory fragmentation to me, with the RSS growing from 47 to 219 MiB in 37 minutes.

The memory leak is extremely small in comparison, with the used heap minimum going from about 5 to about 10 MiB.

Mar 31 '19 17:03 stefantalpalaru

currently, our rocksdb using default configuration:

target_file_size_base=64MB
target_file_size_multiplier=1
filter_policy=null.

if we change some of the configurations:

target_file_size_base=64MB
target_file_size_multiplier=4 or 8 -> it will reduces number of files, reduces number of file descriptors, faster file access.
filter_policy= 10 bits bloom filter. -> speed up random read if accounts not in state trie.

Apr 15 '19 15:04 jangko

@jangko, can we use the Premix's regress tool as a benchmarking utility when deciding whether to go for these RocksDB tweaks? It would be nice if we can create a database of blocks that can be distributed in some efficient way to multiple machines with various hardware configurations and then we'll be able to use regress to obtain statistics that will inform us regarding the best possible settings.

Apr 25 '19 11:04 zah

regress is too complicated. I observed, the bottleneck of database operations came from building the state trie.

here what I have done: block 4.174.280 already contains 5.819.335 accounts ~24.9GB, it took almost 19 hours to move that 5.8m accounts from one SSD to another SSD.

we can use the hexary-trie to tweak and benchmark the database. both of the hexary-trie and database need more optimization.

Apr 25 '19 12:04 jangko

apparently this is still an issue, syncing a fresh nimbus instance on a high performance machine will result in mediocre sync performance (less than 10 blocks/s) with one thread blocking at 100% and all the other cores using less than 10% cpu

Oct 13 '22 07:10 SjonHortensius

nimbus-eth1 nimbus-eth1 copied to clipboard

resource usage during sync

nimbus-eth1
nimbus-eth1 copied to clipboard