cockroach
cockroach copied to clipboard
kvserver: don't subsume self-blocking latency in scheduler latency
We noticed elevated p99s during scale testing that disappeared when disabling internal timeseries storage. The theory is that the timeseries replicas are quite busy, and the scheduler latency was (unintentionally, I argue) picking up raft handling latencies because it wasn't accounting for the case in which handling a replica was delayed by an inflight handling of that same replica.
This commit addresses this issue, so that the scheduling latency metrics truly reflect only the scheduling latency.
- [ ] TODO: demonstrate different metrics behavior before/after (hotspot kv workload?)
Fixes #147911
Release note: None
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
Hmm, having trouble showing the benefit of this change. I thought that if I disabled range splits and ran a kv0 workload against the range, alongside a change that adds a floor of 1ms for processing anything on a given range, I would prominently see the self-blocking, and the scheduler latency p50 (or at least some of the higher quantiles) should end up in the >1ms range and this should disappear with the fix commit. But with the setup below, even the p99 is in the 100us range, both with and without the fix. I still think the effect should be there. But I don't know what's wrong with the experiment either. Will put this on hold for now.
#!/bin/bash
set -euo pipefail
c=local
(roachprod create -n 1 $c || true) &> /dev/null
roachprod stop $c
./dev build
roachprod put $c cockroach
roachprod start $c:1
roachprod sql $c:1 -- -e 'set cluster setting kv.split_queue.enabled = false;'
roachprod run $c:1 -- ./cockroach workload run kv --init --read-percent 0 --concurrency 10 --max-rate 20 --duration 100s {pgurl:1}
Patch
From 1ddc8c8f879b852e31a0f59a1333ce276f707b3c Mon Sep 17 00:00:00 2001
From: Tobias Grieger <[email protected]>
Date: Tue, 10 Jun 2025 18:54:25 +0200
Subject: [PATCH] [dnm] 1ms sleep
---
pkg/kv/kvserver/replica_raft.go | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index 600a15ce34ba..1004f2fe8da5 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -855,7 +855,6 @@ var noSnap IncomingSnapshot
func (r *Replica) handleRaftReady(
ctx context.Context, inSnap IncomingSnapshot,
) (handleRaftReadyStats, error) {
-
// Don't process anything if this fn returns false.
if fn := r.store.cfg.TestingKnobs.DisableProcessRaft; fn != nil && fn(r.store.StoreID()) {
return handleRaftReadyStats{
@@ -889,6 +888,7 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
stats = handleRaftReadyStats{
tBegin: crtime.NowMono(),
}
+ time.Sleep(time.Millisecond)
defer func() {
stats.tEnd = crtime.NowMono()
}()
@pav-kv it's annoying that I wasn't able to demonstrate the benefit of the change. I still think semantically, this makes sense and would give us a more trustworthy. Let's discuss whether to merge this or not - maybe I missed something in how this all works.
@pav-kv PTAL. I added a proper unit test, so I don't have to figure out why I wasn't able to trigger the self-blocking in the end-to-end test setup.