rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

Throughput not increasing with increased parallelism

Open patelprateek opened this issue 11 months ago • 5 comments

Rocksdb version 6.8.1 and 8.3.1

Performing bulkloads throughput benchmarking on 16 core and 32 core machines. I am doing batchwrites (simialr observation with put)

for bulkload i turn off all compactions , all memtables get written to L0 . I use N threads in parallel to read data from remote storage (network bandwidth is 100GBps ) , gcs and s3 buckets. The threads write in parallel to the same DB instance . At a certain point i do not see higher disk write rate and cpu utilization , i am not really sure where the contention is coming from and what is the saturation point .

For example i use network attached ssd (pd-ssd on gcp) , which should easily support 1.2GB per second write rate , but increasing the threads (that read and add data to skiplist memtable). I am not able to achieve more than 400 MBps and about 50% cpu utilization in one of my workload where i ingest 200+Gb of data (for smaller workload with 100gb or less , i get about 300mbps write rate and 40% cpu utilization at max)

Can someone advice me on how can i maximize either cpu or disk write rate . Why increasing parallelism doesnt help , infact after a certain point i.e going from 16 thread to 32 threads actually gives slightly worse performance) . I checked logs and i am not seeing any stalls .

Any ideas on where the contention comes from , (on writing to multiple DB instance i do see increase in cpu utilization and better write rate , but the ingestion process (merge, flush of memtables and compaction of memtables take overall longer time). Can some one explain this behaviour on why multiple DB instance increase cpu and write rate but higher overall time for ingestion and compaction as opposed to increase threads does not improve cpu or disk write rate

patelprateek avatar Aug 03 '23 08:08 patelprateek

I also use LZ4 compression btw for all levels ( i thought LZ4 can easily crunch through few GB/sec and should not be a bottleneck)

patelprateek avatar Aug 03 '23 08:08 patelprateek

So that I understand, you are only asking about performance when compaction is disabled, so all that happens is memtable flushes to L0? I don't have much experience in that setup.

For cases where leveled compaction is allowed all of the time I don't use compression for L0, L1, L2 because it doesn't save space (those are small levels), it costs a lot of CPU and the benefit is a small reduction in the write rate to storage. But my advice might not apply when compaction is disabled.

Regardless, a goal is to figure out where the bottleneck is. Things that might help include:

  • compaction stats (db_bench --benchmarks=overwrite,stats is how to get them with db_bench)
  • stack traces -- see poormansprofiler.org

mdcallag avatar Sep 06 '23 21:09 mdcallag

@mdcallag : yes i am doing a bulk load of ~250gb or more data , so i turn off all compaction , do a final compaction when all my ~250gb is in L0 I was trying to optimize job so that its write throughput bound i.e max throughput my network attached SSDs caan provide (~1.2GB/sec) , unfortunately i am not able to go beyond 400-500 MB/Sec , I tried increasing read threads so that i can read data and push data faster , but that didnt help , it saturated around 16 threads (i.e going from 16 to 32 didnt help)

So i was trying to figure out whats the point of contention : compression , checksum etc etc

Overall i do find enabling compression does help my workload quite a bit , otherwise even flushing to L0 and then compaction to L1 both take almost 1.5x time longer (i am talking about the case where i am bulk loading 250gb + data)

patelprateek avatar Sep 07 '23 00:09 patelprateek

I don't know whether the perf numbers/problems you describe occur during load (memtable flushes only) or during final compaction.

I don't understand the throughput numbers you quote for lz4 and GCP disk -- lz4 here is compressing a page (maybe 8kb) at a time and the throughput from that will be less than compressing large objects. Hopefully the GCP disk throughput is limited by MB transferred/sec and not by IOPs, but I don't know.

WRT cores, are you giving numbers for HW threads or real cores -- is hyperthreads enabled?

It is reasonable that compression will help when the workload is just doing memtable flush. But as soon as compaction is enabled there will be a price to pay during L0->L1 and L1->L2 compactions which can increase write stalls. But write stalls might not be an issue for your use case.

However in this case it helps during memtable flush, it is likely to hurt during the big first compaction when the load is done. Whether the benefit exceeds the gain is hard to know. I guess you will find out.

For debugging, getting stats or stack traces will help a lot.

For perf -- just guessing because I don't know what the bottleneck is

  • increasing the number of threads devoted to flushing memtables might help
  • disabling fsync to the RocksDB WAL might help, at the risk of losing commits on a server panic

mdcallag avatar Sep 12 '23 13:09 mdcallag

On GCP disk , yes I dont think we will be saturating IOPS before the write throughput (i am monitoring both , we are using network attached SSDs for indexing pipelines) , local SSD (on serving side on which i have created another thread regarding latency optimizations and stats).

For bulk load : i turn off compactions , i turn off any fsyncs , disable wal

I am trying to measure how soon i can move all the data to L0 (initial writes to memtables and then writing them to L0 using 32 concurrent background flush jobs , 64MB memtables , 64-128 MB L0 files, merge memtables typically 1 or 2 )

I measure the manual compaction i do post initial ingestion seperately . I think this part i relatively understand well on number off parallel compaction and we resolved earlier on another thread on enabling parallelization (intra-level comapaction)

patelprateek avatar Sep 12 '23 19:09 patelprateek