cockroach kv: tune single-threaded raft snapshot thoughtput

Raft snapshots are typically throttled by the kv.snapshot_rebalance.max_rate setting at 32MB/s. This prevents these snapshots from being disruptive to the workload, which is a risk until snapshots are integrated into admission control (https://github.com/cockroachdb/cockroach/issues/14768).

However, there are times when users want to increase this rate to recover or rebalance more quickly, so they increase the rate limit setting. We have found that once this rate limit is increased, snapshots will still only stream at a maximum of around 120MB/s, well below the hardware limits for well-provisioned systems (10-25 GiB networks, ~1 GB/s disk IO).

We should find and eliminate the bottlenecks in snapshot transmission so that users who want to can saturate their hardware.

Jira issue: CRDB-37766

Epic CRDB-37617

Apr 11 '24 20:04 nvanbenschoten

cc @cockroachdb/replication

Apr 11 '24 20:04 blathers-crl[bot]

I am excited to look into this one We have found that once this rate limit is increased, snapshots will still only stream at a maximum of around 120MB/s just wanted to confirm where do we measure the snapshot stream throughput? is it in the crdb ui console?

Apr 20 '24 21:04 lyang24

just wanted to confirm where do we measure the snapshot stream throughput? is it frin the crdb ui console?

Thats correct, the replication dashboard page has the snapshot received bytes rated metric. The metric is range.snapshots.rcvd-bytes.

Apr 22 '24 00:04 kvoli

I did some testing on roblox environment I cannot reproduce this bottle with v23.2.2. May i get more details on the testing set up that produced 120mb limit?

my set up is 21 node multi region cluster with 10 million row of bank workload, I configured kv.snapshot_rebalance.max_rate to 1024mib

cockroach workload init bank --rows 10000000 --ranges 3500 --payload-bytes 10000 --data-loader IMPORT ""

cockroach workload run bank --rows 10000000 --batch-size 100 --payload-bytes 10000 --concurrency 64 --tolerate-errors ""

then i paused 3 nodes on Chicago region, it seems like the snapshot recv limit is closely saturated

Apr 22 '24 23:04 lyang24

snapshots will still only stream at a maximum of around 120MB/s is referring to single node throughput what i am measuring is cluster wide. please ignore my previous comment.

Apr 23 '24 22:04 lyang24

I have confirmation on the snapshot apply throughput limit and some questions on it.

Details: when i looked at snapshot sent and receive metrics per node I noticed they are about the same (although those two do not have 1 to 1 mapping). I wondered if the bottleneck are the send process. This is the code where snapshots are sent and below are some matching logs in our system to prove 120MB/s limit

I240429 20:46:34.052031 137927657 13@kv/kvserver/store_snapshot.go:1869 ⋮ [T1,Vsystem,n60,s60,r1279894/1:‹/Table/318/1/95{292542-339525}›] 5636  streamed snapshot fb91a7a3 at applied index 55218 to (n44,s44):7LEARNER with 450 MiB in 3.83s @ 117 MiB/s: kvs=46998 rangeKVs=0 sharedSSTs=0, rate-limit: 1.0 GiB/s, queued: 0.00s
I240429 20:46:43.427214 137931148 13@kv/kvserver/store_snapshot.go:1869 ⋮ [T1,Vsystem,n60,s60,r1278868/1:‹/Table/318/1/7{29645…-30000…}›] 5637  streamed snapshot e9274d39 at applied index 55724 to (n18,s18):11LEARNER with 340 MiB in 2.85s @ 119 MiB/s: kvs=35506 rangeKVs=0 sharedSSTs=0, rate-limit: 1.0 GiB/s, queued: 0.00s

I dive into the code a bit more I found there are lots of timers that time different phases of the snapshot sending process https://github.com/cockroachdb/cockroach/blob/4e178debf6f4fb438ab2464b33f6ee932d2fb356/pkg/kv/kvserver/store_snapshot.go#L740C1-L752C37

is there anyways that i can visualize the timer results that might give good indication for bottleneck? cc @kvoli

Apr 29 '24 22:04 lyang24

is there anyways that i can visualize the timer results that might give good indication for bottleneck?

You may be able to visualize the timer results by manually pushing a range that is under-replicated via the replicate queue. You can do this via the advanced debug page or directly via the endpoint at #/debug/enqueue_range?rangeID=XXX. This will produce a trace table which should show those timing events.

Apr 30 '24 03:04 kvoli

is there anyways that i can visualize the timer results that might give good indication for bottleneck?

You may be able to visualize the timer results by manually pushing a range that is under-replicated via the replicate queue. You can do this via the advanced debug page or directly via the endpoint at #/debug/enqueue_range?rangeID=XXX. This will produce a trace table which should show those timing events.

The /debug/enqueue_range endpoint is wonderful I was able to capture a couple of traces

tag_groups:<name:"snapshot_timing_tag" tags:<key:"rateLimit" value:"3ms" > tags:<key:"totalTime" value:"4.1s" > tags:<key:"iter" value:"4.1s" > tags:<key:"send" value:"577ms" > > 

tag_groups:<name:"snapshot_timing_tag" tags:<key:"rateLimit" value:"973\302\265s" > tags:<key:"totalTime" value:"2.6s" > tags:<key:"iter" value:"2.6s" > tags:<key:"send" value:"211ms" > > start_time:<seconds:1714591174 nanos:262669556 > duration:<seconds:4 nanos:859442097 >

From the trace it appears that most of the time is spent on IterateReplicaKeySpans, is it a good direction to dive deep into that iterator to see if there are performance bottlenecks?

Also there are a couple of component to the snapshot transfer time: waiting in queue, iterating key spans, rate-limiting, and send bytes over the wire. I felt like that might not be an apples to apples comparison with hardware limits (i/o, network).

nit: the second trace looks a bit off with 973\302\265s i am trying to see how to properly format it.

cc @nvanbenschoten @kvoli

May 01 '24 22:05 lyang24

From the trace it appears that most of the time is spent on IterateReplicaKeySpans, is it a good direction to dive deep into that iterator to see if there are performance bottlenecks?

That is a good direction to look into. Feel free to add more timing tags in other places as well, if they prove useful.

nit: the second trace looks a bit off with 973\302\265s i am trying to see how to properly format it.

That does seem off, that would be the snapshot rate limit configured setting?

May 02 '24 05:05 kvoli

I upgraded to v24.1 with my custom change to add more timers and iterator stats print out https://github.com/cockroachdb/cockroach/pull/124067 the send up is the bank dataset with no traffic I tuned the replication factor from 3 -> 5, all snapshots are delegated.

      num_replicas = 3 -> 5,
      num_voters = 3 -> 5,
      constraints = '{+region=us-central: 1 -> 2, +region=us-east-1: 1 -> 2, +region=us-east-2: 1}',
      voter_constraints = '{+region=us-central: 1 -> 2, +region=us-east-1: 1 -> 2, +region=us-east-2: 1}'

I got mixed signals.

I observed grpc bottleneck for the first time where 256mb range took 3 seconds over the wire. LEARNER with 256 MiB in 3.77s @ 68 MiB/s: kvs=26876 rangeKVs=0 sharedSSTs=0, externalSSTs=0, rate-limit: 1.0 GiB/s, queued: 0.30s" sender tags tags:<key:"iter" value:"3.8s" > tags:<key:"send" value:"3s" > receiver tags <key:"sst" value:"865ms" > tags:<key:"sstFinalize" value:"2ms" > tags:<key:"checksum" value:"31ms" > tags:<key:"recv" value:"2.9s" > > start_time:<seconds:1715757398 nanos:209378031 > duration:<seconds:3 nanos:795977518 >

I observed the sender bottleneck on a different node where lsm tree is slow, it took 2.3s for sender/ receiver to process the data. LEARNER with 374 MiB in 3.33s @ 112 MiB/s: kvs=39187 rangeKVs=0 sharedSSTs=0, externalSSTs=0, rate-limit: 1.0 GiB/s, queued: 0.00s sender tags: tags:<key:"iter" value:"3.3s" > tags:<key:"send" value:"987ms" receiver tags: tags:<key:"totalTime" value:"3.3s" > tags:<key:"sst" value:"2.2s" > tags:<key:"sstFinalize" value:"4ms" > tags:<key:"checksum" value:"64ms" > tags:<key:"recv" value:"1.1s" > I printed out the iterator stats on the sender side below is the largest in terms of bytes: iterator stats: \342\200\271(interface (dir, seek, step): (fwd, 1, 39085), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 1, 39085), (rev, 0, 0)),\342\200\272\n\342\200\271(internal-stats: (block-bytes: (total 374MB, cached 32KB, read-time 1.751983268s)), (points: (count 39K, key-bytes 764KB, value-bytes 373MB, tombstoned 0))) and from the new tags it points out most of the time is spend on iterating point keys

		for ok := true; ok && err == nil; ok, err = iter.NextEngineKey() {
			kvs++
			if b == nil {
				b = kvSS.newWriteBatch()
			}
			key, err := iter.UnsafeEngineKey()
			if err != nil {
				return err
			}
			v, err := iter.UnsafeValue()
			if err != nil {
				return err
			}
			if err = b.PutEngineKey(key, v); err != nil {
				return err
			}
			if err = maybeFlushBatch(); err != nil {
				return err
			}
		}

just for curiosity I ran network speed test on the host that has grpc bottleneck vs snapshot in crdb docker 256mb / 3 = 85mbps:

root@node31:~# speedtest-cli --secure
Retrieving speedtest.net configuration...
Testing from Roblox...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Windstream (Chicago, IL) [31.77 km]: 2.298 ms
Testing download speed................................................................................
Download: 4250.59 Mbit/s
Testing upload speed......................................................................................................
Upload: 3305.07 Mbit/s

I also ran the disk performance test on the host that has lsm bottleneck vs read in crdb docker 374mb / 1.75s = 213mbps

Last login: Tue Apr 23 00:15:15 2024 from 192.168.239.212
root@node65:~# sudo hdparm -Tt /dev/sda

/dev/sda:
 Timing cached reads:   17562 MB in  1.98 seconds = 8861.72 MB/sec
 Timing buffered disk reads: 1446 MB in  3.00 seconds = 481.81 MB/sec

May 15 '24 07:05 lyang24

for grpc throughput I plan to experiment with: MaxSendMsgSize are already set at maximum in cockcorachdb MaxRecvMsgSize are already set at maximum in cockcorachdb InitialWindowSize:

set COCKROACH_RPC_INITIAL_WINDOW_SIZE to 64mb or even remove the maximum in the code and set to 1 GB
set os level defaults to in two of the machines to observe impacts on throughput sysctl -w net.core.rmem_default sysctl -w net.core.wmem_default

for pebble read performance: I recall the doc on Roblox / crl performance testing 2023 had an action item of Tune pebble block cache size is this something worth look into?

May 16 '24 23:05 lyang24

cockroach cockroach copied to clipboard

kv: tune single-threaded raft snapshot thoughtput

cockroach
cockroach copied to clipboard