cockroach roachtest: kv/restart/nodes=12 failed

roachtest.kv/restart/nodes=12 failed with artifacts on master @ 5bb5e004b5ad8bf7d6436466ba998251fa2f9117:

(monitor.go:149).Wait: monitor failure: QPS of 1421.80 at time 2025-06-14 16:50:22.265588386 +0000 UTC m=+39979.775524043 is below minimum allowable QPS of 2250.00
test artifacts and logs in: /artifacts/kv/restart/nodes=12/run_1

Parameters:

arch=amd64
cloud=gce
coverageBuild=false
cpu=8
encrypted=false
fs=ext4
localSSD=false
metamorphicLeases=expiration
runtimeAssertionsBuild=false
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!}

Jira issue: CRDB-51550

Jun 14 '25 17:06 cockroach-teamcity

This is an admission control test—I think we still need to update a mapping somewhere that maps admission control to T-storage.

This test stops a node and then restarts it after 10m. When the node comes back, we see a disruption.

Jun 18 '25 14:06 jbowens

Opened a PR to updated the labels: https://github.com/cockroachdb/cockroach/pull/148513

Jun 18 '25 14:06 RaduBerinde

DD tsdump here but I get lots of bogus-looking data so using Grafana

Looks like just after n12 comes back after scheduled downtime, n5 has a few (maybe just one) extremely slow commit. It's very clear in the logs, but I needed to go up to 99.99th percentile to make it show in the histogram:

[...]
cockroachI250614 16:50:21.111136 698 kv/kvserver/replica_raft.go:1784 ⋮ [T1,Vsystem,n2,s2,r2078/3:/Table/106/1/‹-253›{‹860›…-‹768›…},raft] 9699  slow non-blocking raft commit: commit-wait 3.99084665s sem 140ns
cockroachI250614 16:50:21.111157 698 kv/kvserver/replica_raft.go:1784 ⋮ [T1,Vsystem,n2,s2,r15572/2:/Table/106/1/‹-89›{‹6005›…-‹5083›…},raft] 9700  slow non-blocking raft commit: commit-wait 3.995607989s sem 265ns
cockroachI250614 16:50:21.111173 698 kv/kvserver/replica_raft.go:1784 ⋮ [T1,Vsystem,n2,s2,r5342/5:/Table/106/1/‹916›{‹2500›…-‹3423›…},raft] 9701  slow non-blocking raft commit: commit-wait 3.985876861s sem 285ns
cockroachI250614 16:50:21.111192 698 kv/kvserver/replica_raft.go:1784 ⋮ [T1,Vsystem,n2,s2,r3078/3:/Table/106/1/‹893›{‹3772›…-‹4694›…},raft] 9702  slow non-blocking raft commit: commit-wait 3.984881882s sem 295ns
cockroachI250614 16:50:21.111205 698 kv/kvserver/replica_raft.go:1784 ⋮ [T1,Vsystem,n2,s2,r5196/1:/Table/106/1/‹13›{‹0504›…-‹1426›…},raft] 9703  slow non-blocking raft commit: commit-wait 3.990149484s sem 152ns
cockroachI250614 16:50:21.111222 698 kv/kvserver/replica_raft.go:1784 ⋮ [T1,Vsystem,n2,s2,r4378/3:/Table/106/1/‹353›{‹4680›…-‹5602›…},raft] 9704  slow non-blocking raft commit: commit-wait 3.995029462s sem 266ns
cockroachI250614 16:50:21.111236 698 kv/kvserver/replica_raft.go:1784 ⋮ [T1,Vsystem,n2,s2,r5210/3:/Table/106/1/‹80›{‹5621›…-‹6543›…},raft] 9705  slow non-blocking raft commit: commit-wait 3.990175875s sem 266ns
cockroachI250614 16:50:21.111249 698 kv/kvserver/replica_raft.go:1784 ⋮ [T1,Vsystem,n2,s2,r5354/1:/Table/106/1/‹73›{‹1976›…-‹2068›…},raft] 9706  slow non-blocking raft commit: commit-wait 3.984902987s sem 340ns
cockroachW250614 16:50:21.117505 630 kv/kvserver/spanlatch/manager.go:677 ⋮ [T1,Vsystem,n2,s2,r1/2:/{Min-System/NodeL…},raft] 9707  Put [/Meta2/Table/106/1/‹8202395801186716072›], [txn: ddf2a31d] has held latch for 4s. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.

Surprisingly to me, the storage_wal_fsync_latency histogram does not show a corresponding spike, not even at the max:

I've also looked at the pebble logs, and yes, pebble is constantly compacting stuff, writing SSTs, deleting old ones, since this is a write-heavy workload, but it doesn't look too different from regular operations and in particular there's no write stall.

I still have to assume that this is something inside of pebble, since multiple goroutines (replicas) independently log these slow commits, and since the pebble logs make continuous progress (deleting SSTs) in the meantime (so it's not a fat scheduler stall). And the close-to-4s values logged in the message above come from (Batch).CommitStats^1.

Also, we failed a node liveness heartbeat on the engine sync, so all paths into storage seem to just have blocked, not just the raft log commit path:

cockroachW250614 16:50:20.972934 712 kv/kvserver/liveness/liveness.go:673 ⋮ [T1,Vsystem,n2,liveness-hb] 9576  failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 3s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:0›: context deadline exceeded

Reached out to the storage team here.

Removing release blocker since short single blips can sometimes occur.

Jun 20 '25 11:06 tbg

There is a write stall on n12 (maybe you looked at the count metric, where it was missed -- the duration metric is the preferred one).

The p100 of fsync latency has a small spike to ~100ms on n12, and disk write bandwidth is high for some interval while it catches up.

There are QRFC send-queues developing, so we are pacing work to n12.

AC is not very good wrt estimating tokens to avoid write stalls, especially if there are short lived write stall blips, which would explain this.

Jun 20 '25 13:06 sumeerbhola

The initial write stalls show up in these io_load_listener logs ("write stalls 15" followed by "write stalls 4")

I250614 16:47:49.996580 1675 util/admission/io_load_listener.go:778 â‹® [T1,Vsystem,n12,s12] 2384 IO overload: compaction score 0.400 (233 ssts, 8 sub-levels), L0 growth 1.3 GiB (write 1.3 GiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 227287 (120243 bypassed) with 863 MiB acc-write (12 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 1.3 GiB adjusted-LSM-writes + 7.0 GiB adjusted-disk-writes + write-model 1.56x+1 B (smoothed 1.66x+1 B) + l0-ingest-model 0.00x+0 B (smoothed 0.75x+1 B) + ingest-model 0.00x+0 B (smoothed 1.00x+1 B) + write-amp-model 5.30x+1 B (smoothed 27.90x+1 B) + at-admission-tokens 1 B, compacted 941 MiB [â‰ˆ470 MiB], flushed 1.5 GiB [â‰ˆ1.5 GiB] (mult 0.92); admitting 491 MiB (rate 33 MiB/s) (elastic 235 MiB rate 16 MiB/s) due to L0 growthâ€¹(used token lower bound)â€º (used total: 0 B elastic 0 B); write stalls 15; diskBandwidthLimiterâ€¹ (unlimited)â€º (tokenUtilization 0.00, tokensUsed (elastic 0 B, snapshot 0 B, regular 0 B) tokens (write 0 B (prev 0 B), read 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)

I250614 16:48:04.999578 1675 util/admission/io_load_listener.go:778 â‹® [T1,Vsystem,n12,s12] 7383 IO overload: compaction score 0.350 (78 ssts, 7 sub-levels), L0 growth 798 MiB (write 798 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 94162 (66015 bypassed) with 230 MiB acc-write (6.8 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 798 MiB adjusted-LSM-writes + 6.9 GiB adjusted-disk-writes + write-model 3.00x+1.2 KiB (smoothed 2.33x+595 B) + l0-ingest-model 0.00x+0 B (smoothed 0.75x+1 B) + ingest-model 0.00x+0 B (smoothed 1.00x+1 B) + write-amp-model 8.91x+1 B (smoothed 18.41x+1 B) + at-admission-tokens 1 B, compacted 1.1 GiB [â‰ˆ775 MiB], flushed 1.5 GiB [â‰ˆ1.5 GiB] (mult 0.87); admitting 957 MiB (rate 64 MiB/s) (elastic 485 MiB rate 32 MiB/s) due to L0 growthâ€¹(used token lower bound)â€º (used total: 236 MiB elastic 209 MiB); write stalls 4; diskBandwidthLimiterâ€¹ (unlimited)â€º (tokenUtilization 0.00, tokensUsed (elastic 0 B, snapshot 0 B, regular 0 B) tokens (write 0 B (prev 0 B), read 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)

Jun 20 '25 13:06 sumeerbhola

n12 is expected to be overloaded since it's coming back from 10 min of downtime, but the problem that breaks the test is slow pebble on n2. The load listener logs show no write stalls or severe overload on n2:

cockroachI250614 16:45:00.144372 718 util/admission/io_load_listener.go:778 ⋮ [T1,Vsystem,n2,s2] 9244  IO overload: compaction score 0.050 (9 ssts, 1 sub-levels), L0 growth 282 MiB (write 282 MiB (ignored 0 B) ingest 6.1 KiB (ignored 1.4 MiB)): requests 36349 (17514 bypassed) with 144 MiB acc-write (4.4 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 282 MiB adjusted-LSM-writes + 2.3 GiB adjusted-disk-writes + write-model 1.
96x+1 B (smoothed 1.94x+1 B) + l0-ingest-model 0.00x+0 B (smoothed 0.75x+1 B) + ingest-model 0.00x+0 B (smoothed 1.00x+1 B) + write-amp-model 8.54x+1 B (smoothed 8.60x+1 B) + at-admission-tokens 210 B, compacted 286 MiB [≈280 MiB], flushed 3.4 GiB [≈0 B] (mult 1.00); admitting all (used total: 276 MiB elastic 47 KiB); write stalls 0; diskBandwidthLimiter‹ (unlimited)› (tokenUtilization 0.00, tokensUsed (elastic 0 B, snapshot
 0 B, regular 0 B) tokens (write 0 B (prev 0 B), read 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)
cockroachI250614 16:51:00.149861 718 util/admission/io_load_listener.go:778 ⋮ [T1,Vsystem,n2,s2] 10507  IO overload: compaction score 0.100 (25 ssts, 2 sub-levels), L0 growth 207 MiB (write 207 MiB (ignored 0 B) ingest 1.0 KiB (ignored 5.7 MiB)): requests 42706 (24705 bypassed) with 110 MiB acc-write (5.4 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 207 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model
1.89x+1 B (smoothed 1.92x+1 B) + l0-ingest-model 0.00x+0 B (smoothed 0.75x+1 B) + ingest-model 0.00x+0 B (smoothed 1.00x+1 B) + write-amp-model 5.63x+1 B (smoothed 8.38x+1 B) + at-admission-tokens 90 B, compacted 169 MiB [≈225 MiB], flushed 3.2 GiB [≈0 B] (mult 1.00); admitting elastic 281 MiB (rate 19 MiB/s) due to L0 growth (used total: 214 MiB elastic 2.3 MiB); write stalls 0; diskBandwidthLimiter‹ (unlimited)› (tokenUtil
ization 0.00, tokensUsed (elastic 0 B, snapshot 0 B, regular 0 B) tokens (write 0 B (prev 0 B), read 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)
cockroachI250614 16:51:15.150080 718 util/admission/io_load_listener.go:778 ⋮ [T1,Vsystem,n2,s2] 10650  IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 222 MiB (write 222 MiB (ignored 0 B) ingest 6.1 KiB (ignored 6.5 MiB)): requests 43284 (25540 bypassed) with 124 MiB acc-write (5.5 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 222 MiB adjusted-LSM-writes + 3.2 GiB adjusted-disk-writes + write-model 1
.79x+1 B (smoothed 1.86x+1 B) + l0-ingest-model 0.00x+0 B (smoothed 0.75x+1 B) + ingest-model 0.00x+0 B (smoothed 1.00x+1 B) + write-amp-model 14.58x+1 B (smoothed 11.48x+1 B) + at-admission-tokens 50 B, compacted 263 MiB [≈244 MiB], flushed 3.1 GiB [≈0 B] (mult 1.00); admitting all (used total: 238 MiB elastic 253 KiB); write stalls 0; diskBandwidthLimiter‹ (unlimited)› (tokenUtilization 0.00, tokensUsed (elastic 0 B, snaps
hot 0 B, regular 0 B) tokens (write 0 B (prev 0 B), read 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)
cockroachI250614 16:55:45.157057 718 util/admission/io_load_listener.go:778 ⋮ [T1,Vsystem,n2,s2] 10930  IO overload: compaction score 0.100 (36 ssts, 2 sub-levels), L0 growth 291 MiB (write 291 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 38540 (20322 bypassed) with 144 MiB acc-write (4.9 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 291 MiB adjusted-LSM-writes + 2.9 GiB adjusted-disk-writes + write-model 2.02x+1
B (smoothed 2.00x+1 B) + l0-ingest-model 0.00x+0 B (smoothed 0.75x+1 B) + ingest-model 0.00x+0 B (smoothed 1.00x+1 B) + write-amp-model 10.15x+1 B (smoothed 11.29x+1 B) + at-admission-tokens 43 B, compacted 235 MiB [≈263 MiB], flushed 3.4 GiB [≈0 B] (mult 1.00); admitting elastic 329 MiB (rate 22 MiB/s) due to L0 growth (used total: 285 MiB elastic 0 B); write stalls 0; diskBandwidthLimiter‹ (unlimited)› (tokenUtilization 0.
00, tokensUsed (elastic 0 B, snapshot 0 B, regular 0 B) tokens (write 0 B (prev 0 B), read 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)
cockroachI250614 16:56:00.157342 718 util/admission/io_load_listener.go:778 ⋮ [T1,Vsystem,n2,s2] 10938  IO overload: compaction score 0.100 (62 ssts, 2 sub-levels), L0 growth 289 MiB (write 289 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 38503 (20064 bypassed) with 145 MiB acc-write (4.8 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 289 MiB adjusted-LSM-writes + 2.6 GiB adjusted-disk-writes + write-model 1.99x+1
B (smoothed 1.99x+1 B) + l0-ingest-model 0.00x+0 B (smoothed 0.75x+1 B) + ingest-model 0.00x+0 B (smoothed 1.00x+1 B) + write-amp-model 9.16x+1 B (smoothed 10.22x+1 B) + at-admission-tokens 73 B, compacted 250 MiB [≈256 MiB], flushed 3.1 GiB [≈0 B] (mult 1.00); admitting elastic 321 MiB (rate 21 MiB/s) due to L0 growth (used total: 290 MiB elastic 0 B); write stalls 0; diskBandwidthLimiter‹ (unlimited)› (tokenUtilization 0.0
0, tokensUsed (elastic 0 B, snapshot 0 B, regular 0 B) tokens (write 0 B (prev 0 B), read 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)

Jun 20 '25 13:06 tbg

There is discussion in https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1750447240104009?thread_ts=1750420169.044609&cid=CAC6K3SLU about figuring out why n2 was slow.

Summary: There is a gap during an ingest operation which probably happened after the ingest operation had been assigned a seqNum, which blocked other writes from being visible. The ingest operation does IO after the seqNum has been assigned, in ingestApply, which involves reading existing sstables, and writing to the MANIFEST. The file deletes happening in that gap may have caused some unexpected slowness.

Jun 20 '25 22:06 sumeerbhola