nimbus-eth1 icon indicating copy to clipboard operation
nimbus-eth1 copied to clipboard

resource usage during sync

Open stefantalpalaru opened this issue 5 years ago • 17 comments

SVG graph with per-process statistics provided by pidstat (missing the network activity, for now, but still interesting): SVG graph

That CPU usage over 100% must come from the multi-threaded rocksdb library.

stefantalpalaru avatar Mar 05 '19 11:03 stefantalpalaru

interesting, disk write pattern shows there is still room for improvements.

jangko avatar Mar 05 '19 12:03 jangko

Another run of ./build/nimbus --prune:archive --port:30304, with network traffic added (using nethogs), better colours and some variables drawn as areas instead of lines.

Full dataset, one second per pixel:

nimbus3-long.svg

Five seconds per pixel, to better see the memory leak:

nimbus3-short.svg

Except for the short RocksDB spikes every 6-7 minutes or so, most of the time is spent waiting for data from the network or maxing out a CPU core while processing that data. It all looks very serialised, which means it will benefit from parallelisation.

The average download speed is extremely low, at 7.83 kB/s. Disk I/O is a non-issue right now, on this SSD I'm using.

stefantalpalaru avatar Mar 06 '19 02:03 stefantalpalaru

is the keep climbing up red-line-RSS an indicator of memory leak? if yes, that is very bad.

jangko avatar Mar 06 '19 02:03 jangko

Yep: https://en.wikipedia.org/wiki/Resident_set_size When it drops, thats's a garbage collection. I don't see a legitimate reason to hang on to so much data in RAM, during execution, so my guess is that the upward trend is due to a memory leak.

What's weird is that even the stack keeps growing, albeit much slower.

stefantalpalaru avatar Mar 06 '19 02:03 stefantalpalaru

which region of blocks do you sync? I mean the block number. I noticed during block 600K to 700K, memory consumption is very high then stable at block 800K to 900K. I think I will do some measurement myself to improve block sync speed.

@stefantalpalaru: can you share some script with me? how did you produce that svg?

jangko avatar Mar 06 '19 02:03 jangko

which region of blocks do you sync? I mean the block number. I noticed during block 600K to 700K, memory consumption is very high then stable at block 800K to 900K.

I started with an empty db and let it run until it crashed due to an assert in transaction rollback (vendor/nim-eth/eth/trie/db.nim:145 - "doAssert t.db.mostInnerTransaction == t and t.state == Pending").

I don't see block numbers in the output log, because those are logged at the TRACE level which is not included by default.

@stefantalpalaru: can you share some script with me? how did you produce that svg?

Freshly published: https://github.com/status-im/process-stats

stefantalpalaru avatar Mar 06 '19 14:03 stefantalpalaru

thank you very much.

jangko avatar Mar 06 '19 15:03 jangko

The backend database contribute significantly to block syncing speed. When the database size already reach 20GB+, it become slower and slower because rocksdb doing background compaction. Writing to the database seems not slowing down because of WAL(write ahead layer) mechanism, but reading from database can be really-really slow when it competes with compaction.

at 50GB+ (900K blocks), it become very slow. my current solution is: I created separate databases on separate physical drives. every time I have synced around or near 20GB, I move the database to drive A, and open it as read only database.

when a database opened as read only on drive A, it will doing compaction faster because it does not have to compete with regular read write operation on drive B.

without this poor man sharding, the drive activity will always 100%, while doing this simple sharding, the disk activity on both drive A or B only less than 30%.

for comparison, using single db, to sync 1.4M blocks will take many hours. but when using several 20GB dbs, it will take less than one hour.

jangko avatar Mar 19 '19 02:03 jangko

Thanks for sharing this, @jangko. BTW, how does the lmdb performance compare to rocksdb?

zah avatar Mar 19 '19 08:03 zah

I stopped using it because it is slower compared to rocksdb when still syncing below 100K blocks, don't know the performance if it contains more data.

jangko avatar Mar 19 '19 09:03 jangko

Would it be possible to actually use this approach of "poor man's sharding" as a solution? Maybe divide data into 10GB snapshots, each snapshot can be one such shard i.e. one rocksdb database, and then use those same snapshots to retrieve data across the network for faster sync among Nimbus clients?

Swader avatar Mar 20 '19 13:03 Swader

https://www.zeroknowledge.fm/9 - interview with one of the parity devs about how they're tuning rocksdb

arnetheduck avatar Mar 20 '19 13:03 arnetheduck

A look at allocated RAM (RSS) versus heap usage according to the GC:

nimbus4.svg

heap.svg

To get these heap stats, I added at the end of persistBlocks(), in nimbus/p2p/chain.nim:

  dumpNumberOfInstances()
  echo "===", getTime().toUnix()

(and an import times above the function)

Nimbus compile flags: make NIMFLAGS="--opt:speed -d:nimTypeNames" nimbus

I ran Nimbus like this: rm -rf ~/.cache/nimbus/db; ./build/nimbus --prune:archive --maxpeers:250 --log-level:trace --log-file:output6.log > heap.txt

I processed "heap.txt" using this quick and dirty script: https://gist.github.com/stefantalpalaru/0b502def452591aaca289ec8fc119e8b


This looks like memory fragmentation to me, with the RSS growing from 47 to 219 MiB in 37 minutes.

The memory leak is extremely small in comparison, with the used heap minimum going from about 5 to about 10 MiB.

stefantalpalaru avatar Mar 31 '19 17:03 stefantalpalaru

currently, our rocksdb using default configuration:

  • target_file_size_base=64MB
  • target_file_size_multiplier=1
  • filter_policy=null.

if we change some of the configurations:

  • target_file_size_base=64MB
  • target_file_size_multiplier=4 or 8 -> it will reduces number of files, reduces number of file descriptors, faster file access.
  • filter_policy= 10 bits bloom filter. -> speed up random read if accounts not in state trie.

jangko avatar Apr 15 '19 15:04 jangko

@jangko, can we use the Premix's regress tool as a benchmarking utility when deciding whether to go for these RocksDB tweaks? It would be nice if we can create a database of blocks that can be distributed in some efficient way to multiple machines with various hardware configurations and then we'll be able to use regress to obtain statistics that will inform us regarding the best possible settings.

zah avatar Apr 25 '19 11:04 zah

regress is too complicated. I observed, the bottleneck of database operations came from building the state trie.

here what I have done: block 4.174.280 already contains 5.819.335 accounts ~24.9GB, it took almost 19 hours to move that 5.8m accounts from one SSD to another SSD.

we can use the hexary-trie to tweak and benchmark the database. both of the hexary-trie and database need more optimization.

jangko avatar Apr 25 '19 12:04 jangko

apparently this is still an issue, syncing a fresh nimbus instance on a high performance machine will result in mediocre sync performance (less than 10 blocks/s) with one thread blocking at 100% and all the other cores using less than 10% cpu

SjonHortensius avatar Oct 13 '22 07:10 SjonHortensius