cockroach kvserver: don't subsume self-blocking latency in scheduler latency

We noticed elevated p99s during scale testing that disappeared when disabling internal timeseries storage. The theory is that the timeseries replicas are quite busy, and the scheduler latency was (unintentionally, I argue) picking up raft handling latencies because it wasn't accounting for the case in which handling a replica was delayed by an inflight handling of that same replica.

This commit addresses this issue, so that the scheduling latency metrics truly reflect only the scheduling latency.

[ ] TODO: demonstrate different metrics behavior before/after (hotspot kv workload?)

Fixes #147911

Release note: None

Jun 06 '25 13:06 tbg

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Jun 06 '25 13:06 blathers-crl[bot]

This change is

Jun 06 '25 13:06 cockroach-teamcity

Hmm, having trouble showing the benefit of this change. I thought that if I disabled range splits and ran a kv0 workload against the range, alongside a change that adds a floor of 1ms for processing anything on a given range, I would prominently see the self-blocking, and the scheduler latency p50 (or at least some of the higher quantiles) should end up in the >1ms range and this should disappear with the fix commit. But with the setup below, even the p99 is in the 100us range, both with and without the fix. I still think the effect should be there. But I don't know what's wrong with the experiment either. Will put this on hold for now.

#!/bin/bash
set -euo pipefail

c=local
(roachprod create -n 1 $c || true) &> /dev/null
roachprod stop $c
./dev build
roachprod put $c cockroach
roachprod start $c:1
roachprod sql $c:1 -- -e 'set cluster setting kv.split_queue.enabled = false;'
roachprod run $c:1 -- ./cockroach workload run kv --init --read-percent 0 --concurrency 10 --max-rate 20 --duration 100s {pgurl:1}

Patch

From 1ddc8c8f879b852e31a0f59a1333ce276f707b3c Mon Sep 17 00:00:00 2001
From: Tobias Grieger <[email protected]>
Date: Tue, 10 Jun 2025 18:54:25 +0200
Subject: [PATCH] [dnm] 1ms sleep

---
 pkg/kv/kvserver/replica_raft.go | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index 600a15ce34ba..1004f2fe8da5 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -855,7 +855,6 @@ var noSnap IncomingSnapshot
 func (r *Replica) handleRaftReady(
 	ctx context.Context, inSnap IncomingSnapshot,
 ) (handleRaftReadyStats, error) {
-
 	// Don't process anything if this fn returns false.
 	if fn := r.store.cfg.TestingKnobs.DisableProcessRaft; fn != nil && fn(r.store.StoreID()) {
 		return handleRaftReadyStats{
@@ -889,6 +888,7 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
 	stats = handleRaftReadyStats{
 		tBegin: crtime.NowMono(),
 	}
+	time.Sleep(time.Millisecond)
 	defer func() {
 		stats.tEnd = crtime.NowMono()
 	}()

Jun 11 '25 09:06 tbg

@pav-kv it's annoying that I wasn't able to demonstrate the benefit of the change. I still think semantically, this makes sense and would give us a more trustworthy. Let's discuss whether to merge this or not - maybe I missed something in how this all works.

Jun 24 '25 09:06 tbg

@pav-kv PTAL. I added a proper unit test, so I don't have to figure out why I wasn't able to trigger the self-blocking in the end-to-end test setup.

Jun 30 '25 11:06 tbg

Build succeeded:

Jul 01 '25 12:07 craig[bot]

cockroach cockroach copied to clipboard

kvserver: don't subsume self-blocking latency in scheduler latency

cockroach
cockroach copied to clipboard