cockroach kvcoord: add DistSender circuit breakers

This patch adds an initial implementation of DistSender replica circuit breakers. Their primary purpose is to prevent the DistSender getting stuck on non-functional replicas. In particular, the DistSender relies on receiving a NLHE from the replica to update its range cache and try other replicas, otherwise it will keep sending requests to the same broken replica which will continue to get stuck, giving the appearance of an unavailable range. This can happen if:

The replica stalls, e.g. with a disk stall or mutex deadlock.
Clients time out before the replica lease acquisition attempt times out, e.g. if the replica is partitioned away from the leader.

If a replica has returned only errors in the past few seconds, or hasn't returned any responses at all, the circuit breaker will probe the replica by sending a LeaseInfo request. This must either return success or a NLHE pointing to a leaseholder. Otherwise, the circuit breaker trips, and the DistSender will skip it for future requests, optionally also cancelling in-flight requests.

Currently, only replica-level circuit breakers are implemented. If a range is unavailable, the DistSender will continue to retry replicas as today. Range-level circuit breakers can be added later if needed, but are considered out of scope here.

The circuit breakers are disabled by default for now. Some follow-up work is likely needed before they can be enabled by default:

Improve probe scalability. Currently, a goroutine is spawned per replica probe, which is likely too expensive at large scales. We should consider batching probes to nodes/stores, and using a bounded worker pool.
Consider follower read handling, e.g. by tracking the replica's closed timestamp and allowing requests that may still be served by it even if it's partitioned away from the leaseholder.
Improve observability, with metrics, tracing, and logging.
Comprehensive testing and benchmarking.

This will be addressed separately.

Touches #105168. Touches #104262. Touches #80713. Epic: none Release note: None

Feb 08 '24 12:02 erikgrinaker

This change is

Feb 08 '24 12:02 cockroach-teamcity

I decided to remove the intermediate range-level tracking for now, and keep range-level circuit breakers out of scope -- it comes with a cost, and we should aggressively limit scope for 24.1 to the problem at hand (stalled/partitioned replicas). We can deal with range-level circuit breakers later.

Feb 12 '24 12:02 erikgrinaker

I think this is starting to take shape. By default, we don't cancel in-flight requests when the breaker trips, only subsequent requests. Here are some rough benchmarks (single run) after some initial optimization, which show negligible overhead of about 110 ns per request in the common case, even at high concurrency. This incurs a single allocation due to a returned closure, which can possibly be eliminated.

BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1-24          	 9872906	       112.0 ns/op	      48 B/op	       1 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=16-24         	 7917506	       142.0 ns/op	      48 B/op	       1 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=64-24         	10722211	       114.4 ns/op	      48 B/op	       1 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1024-24       	10988108	       110.6 ns/op	      48 B/op	       1 allocs/op

If we enable cancellation of in-flight requests, the cost increases substantially due to the additional context injection and tracking, especially at high concurrency. Some of this can be optimized away, but it gives us an upper bound on what this tracking would cost.

BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1-24           	 3737722	       308.9 ns/op	     144 B/op	       3 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=16-24          	 1580942	       770.1 ns/op	     144 B/op	       2 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=64-24          	 1486292	       810.9 ns/op	     144 B/op	       3 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1024-24        	 1458637	       841.4 ns/op	     144 B/op	       3 allocs/op

For comparison, BenchmarkKV/Scan/Native/rows=1 takes about 41,000 ns/op, so this adds 0.3% and 2% overhead for the non-cancellation and cancellation cases respectively. On an end-to-end kv100 workload this overhead will likely be negligible.

We'll need to decide whether cancellation is necessary or not. It may be possible to get the cost under high concurrency down by about half (to ~400 ns/op), which would give about 1% overhead on BenchmarkKV.

Feb 14 '24 12:02 erikgrinaker

Got rid of the closure allocation.

BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1-24          	14411810	        82.75 ns/op	       0 B/op	       0 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=16-24         	 7982485	       149.6 ns/op	       0 B/op	       0 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=64-24         	10309857	       111.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1024-24       	10863777	       113.9 ns/op	       0 B/op	       0 allocs/op

Might be possible to make the cancel tracking lock-free. However, I think WithCancel() always allocates, and there may not be a way to avoid that -- but a single allocation may also be an acceptable cost as long as we get rid of the mutex. Will revisit later.

BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1-24           	 4122973	       277.1 ns/op	      96 B/op	       2 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=16-24          	 1587454	       763.2 ns/op	      96 B/op	       1 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=64-24          	 1487504	       802.7 ns/op	      96 B/op	       2 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1024-24        	 1442265	       835.6 ns/op	      96 B/op	       2 allocs/op

Feb 17 '24 11:02 erikgrinaker

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Feb 18 '24 14:02 blathers-crl[bot]

@andrewbaptist @nvanbenschoten This should be ready for initial review now. Hoping to land it this week, so we can proceed with the follow-up work.

The memory overhead of the circuit breaker is about 682 bytes per replica. We only track replicas that have seen traffic in the past 10 minutes. If we consider 100,000 replicas to be an upper bound, that's about 68 MB of memory, which seems acceptable. It's possible to optimize this down if necessary.

BenchmarkDistSenderCircuitBreakersForReplica-24    	  529228	      3030 ns/op	     682 B/op	      13 allocs/op

When circuit breakers are disabled, there is only the overhead of a settings check:

name                                         time/op       allocs/op
DistSenderCircuitBreakersTrack/disabled-24   19.8ns ± 0%   0.00        n=10

When enabling circuit breakers, but disabling cancellation of in-flight requests, we have fairly minor overhead on the happy path even at high concurrency.

name                                                                           time/op      allocs/op
DistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1-24       108ns ± 0%   0.00
DistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=16-24      224ns ± 3%   0.00
DistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=64-24      154ns ± 3%   0.00
DistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1024-24    151ns ± 2%   0.00

Enabling cancellation of in-flight requests incurs two allocations for the new context and two exclusive mutex locks to (un)track it. This may not actually matter in end-to-end benchmarks (I'll give kv95 a try tomorrow), but it's also possible that we can optimize some of this down.

name                                                                           time/op      allocs/op
DistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1-24        297ns ± 1%   2.00
DistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=16-24       791ns ± 2%   2.00
DistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=64-24       838ns ± 3%   2.00
DistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1024-24     869ns ± 2%   2.00

Feb 18 '24 18:02 erikgrinaker

kv95 benchmarks are inconclusive due to noise. I think it's probably fair to say that even the cancel tracking doesn't make a significant difference here, considering the p50 is at 1.12ms and we're looking at sub-µs penalties. I'll see if I can run some more conclusive benchmarks with larger sample sizes.

name                                       ops/sec    p50        p95        p99
kv95/enc=false/nodes=3/cpu=32/cb=disabled  138k ± 7%  1.12 ±11%  3.38 ± 9%  5.40 ± 7%  n=5
kv95/enc=false/nodes=3/cpu=32/cb=enabled   139k ± 8%  1.12 ±11%  3.36 ± 8%  5.40 ± 7%  n=5
kv95/enc=false/nodes=3/cpu=32/cb=cancel    141k ± 8%  1.08 ±11%  3.34 ±11%  5.30 ± 6%  n=5

Feb 19 '24 12:02 erikgrinaker

I did another run of YCSB/C (read-only Zipfian), which indicates no significant difference between circuit breakers disabled and enabled with cancellation (the most expensive variant):

name            old ops/sec  new ops/sec  delta
ycsb/C/nodes=3   42.8k ± 5%   43.4k ± 2%    ~     (p=0.200 n=9+8)

name            old p50      new p50      delta
ycsb/C/nodes=3    3.00 ± 0%    2.97 ± 2%    ~     (p=0.071 n=6+10)

name            old p95      new p95      delta
ycsb/C/nodes=3    8.00 ± 5%    7.90 ± 0%    ~     (p=0.254 n=8+6)

name            old p99      new p99      delta
ycsb/C/nodes=3    13.6 ± 4%    13.2 ± 3%  -2.73%  (p=0.024 n=9+10)

Feb 19 '24 15:02 erikgrinaker

I'm leaning towards enabling cancellation unconditionally, and did a quick benchmark where we only use a mutex for synchronization, dropping the atomics. We're not really saving too much by doing so, but it simplifies the code at least.

name                                                                        time/op
DistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1-24     284ns ± 2%
DistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=16-24    710ns ± 3%
DistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=64-24    807ns ± 3%
DistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1024-24  834ns ± 4%

Feb 19 '24 18:02 erikgrinaker

Can you try running BenchmarkDistSenderSunnyDay as well. This is the benchmark focused on just dist_sender. I'll take a bit more of a review of this tomorrow.

Feb 20 '24 16:02 andrewbaptist

Can you try running BenchmarkDistSenderSunnyDay as well.

Only did 3 runs since the benchmarks were pretty slow, so used -delta-test none. Overhead is about as expected, except for the high-count non-concurrent ones, which appear to have much higher overhead than the microbenchmarks. I'll have to look into what these benchmarks do.

This is off/atomics:

DistSenderSunnyDay/rf-3/count-1-conc-24        1.18µs ± 3%  1.19µs ± 1%   +0.48%
DistSenderSunnyDay/rf-3/count-1-24             3.37µs ± 0%  3.51µs ± 1%   +4.16%
DistSenderSunnyDay/rf-3/count-1000-conc-24     1.19µs ± 1%  1.21µs ± 1%   +1.12%
DistSenderSunnyDay/rf-3/count-1000-24          4.00µs ± 0%  4.67µs ± 1%  +16.72%
DistSenderSunnyDay/rf-3/count-100000-conc-24   1.21µs ± 1%  1.28µs ± 1%   +5.85%
DistSenderSunnyDay/rf-3/count-100000-24        4.73µs ± 1%  6.00µs ± 1%  +26.95%
DistSenderSunnyDay/rf-5/count-1-conc-24        1.20µs ± 1%  1.24µs ± 2%   +3.57%
DistSenderSunnyDay/rf-5/count-1-24             3.55µs ± 0%  3.72µs ± 0%   +4.87%
DistSenderSunnyDay/rf-5/count-1000-conc-24     1.21µs ± 1%  1.25µs ± 1%   +3.08%
DistSenderSunnyDay/rf-5/count-1000-24          4.34µs ± 0%  5.03µs ± 0%  +15.91%
DistSenderSunnyDay/rf-5/count-100000-conc-24   1.22µs ± 1%  1.28µs ± 1%   +4.37%
DistSenderSunnyDay/rf-5/count-100000-24        5.01µs ± 1%  6.36µs ± 1%  +26.84%
DistSenderSunnyDay/rf-11/count-1-conc-24       1.26µs ± 0%  1.29µs ± 0%   +2.00%
DistSenderSunnyDay/rf-11/count-1-24            4.62µs ± 1%  4.77µs ± 0%   +3.33%
DistSenderSunnyDay/rf-11/count-1000-conc-24    1.28µs ± 0%  1.32µs ± 1%   +2.67%
DistSenderSunnyDay/rf-11/count-1000-24         5.38µs ± 0%  6.26µs ± 2%  +16.47%
DistSenderSunnyDay/rf-11/count-100000-conc-24  1.28µs ± 2%  1.37µs ± 2%   +7.21%
DistSenderSunnyDay/rf-11/count-100000-24       6.01µs ± 1%  7.35µs ± 2%  +22.31%

This is atomics/cancellation:

name                                           old time/op  new time/op  delta
DistSenderSunnyDay/rf-3/count-1-conc-24        1.19µs ± 1%  1.25µs ± 1%   +5.70%
DistSenderSunnyDay/rf-3/count-1-24             3.51µs ± 1%  3.72µs ± 1%   +5.92%
DistSenderSunnyDay/rf-3/count-1000-conc-24     1.21µs ± 1%  1.24µs ± 1%   +2.37%
DistSenderSunnyDay/rf-3/count-1000-24          4.67µs ± 1%  5.20µs ± 0%  +11.25%
DistSenderSunnyDay/rf-3/count-100000-conc-24   1.28µs ± 1%  1.28µs ± 1%   -0.21%
DistSenderSunnyDay/rf-3/count-100000-24        6.00µs ± 1%  6.54µs ± 1%   +9.01%
DistSenderSunnyDay/rf-5/count-1-conc-24        1.24µs ± 2%  1.27µs ± 2%   +2.64%
DistSenderSunnyDay/rf-5/count-1-24             3.72µs ± 0%  3.92µs ± 1%   +5.41%
DistSenderSunnyDay/rf-5/count-1000-conc-24     1.25µs ± 1%  1.29µs ± 2%   +3.07%
DistSenderSunnyDay/rf-5/count-1000-24          5.03µs ± 0%  5.49µs ± 0%   +9.10%
DistSenderSunnyDay/rf-5/count-100000-conc-24   1.28µs ± 1%  1.32µs ± 2%   +3.24%
DistSenderSunnyDay/rf-5/count-100000-24        6.36µs ± 1%  6.83µs ± 1%   +7.44%
DistSenderSunnyDay/rf-11/count-1-conc-24       1.29µs ± 0%  1.34µs ± 1%   +3.96%
DistSenderSunnyDay/rf-11/count-1-24            4.77µs ± 0%  5.02µs ± 1%   +5.08%
DistSenderSunnyDay/rf-11/count-1000-conc-24    1.32µs ± 1%  1.33µs ± 1%   +0.68%
DistSenderSunnyDay/rf-11/count-1000-24         6.26µs ± 2%  6.61µs ± 1%   +5.55%
DistSenderSunnyDay/rf-11/count-100000-conc-24  1.37µs ± 2%  1.41µs ± 1%   +2.40%
DistSenderSunnyDay/rf-11/count-100000-24       7.35µs ± 2%  8.14µs ± 3%  +10.72%

Feb 22 '24 08:02 erikgrinaker

@nvanbenschoten I added the context timeout check we discussed, such that we won't set up a cancellable context if the context already has a timeout below the probe threshold + timeout.

I also enabled both the circuit breakers and cancellation by default.

Feb 22 '24 09:02 erikgrinaker

Verified that this recovers failover/non-system/deadlock/lease=expiration in less than 10 seconds.

Mar 04 '24 13:03 erikgrinaker

Merging on behalf of Erik.

bors r+

Mar 04 '24 22:03 nvanbenschoten

Build succeeded:

Bazel Essential CI (Cockroach)

Mar 04 '24 23:03 craig[bot]

cockroach cockroach copied to clipboard

kvcoord: add DistSender circuit breakers

cockroach
cockroach copied to clipboard