cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

kvcoord: fix extremely rare OOM hazard in the DistSender

Open arulajmani opened this issue 1 year ago • 1 comments

The outer loop in the DistSender subdivides a BathRequest into partial batches corresponding to a single range and sends them out. Previously, the resolved span passed into the function responsible for sending out this partial batch (sendPartialBatch) corresponded to the span of the entire batch request (as opposed to the partial batch). This resolved span is used to check if the request needs to be subdivided further between retries in the outer loop of the DistSender. Given the wrong parameter value, we'd always end up determining that the batch needed to be subdivided.

This wasn't really an issue in practice as we don't expect too many retries here. However, if we did (eg. timeouts causing the transport to be exhausted on every try), the infinite recursion here could lead to an OOM.

Closes #87167

Release note: None

arulajmani avatar Sep 22 '22 20:09 arulajmani

This change is Reviewable

cockroach-teamcity avatar Sep 22 '22 20:09 cockroach-teamcity

Great find! Just for my understanding, in https://github.com/cockroachdb/cockroach/issues/87167 did we actually hit the "extremely rare OOM hazard"? We do seem to get a very large stack which points to yes, but then also there wasn't an extremely deep DistSender stack there.

tbg avatar Sep 23 '22 07:09 tbg

Yeah, great find @arulajmani, thanks for tracking this down!

in #87167 did we actually hit the "extremely rare OOM hazard"? We do seem to get a very large stack which points to yes, but then also there wasn't an extremely deep DistSender stack there.

I'm pretty sure that DistSender goroutine was to blame for the stack overflow, but I agree that the stack was suspiciously shallow for a 1 GB stack. I can only assume that the stack frames were very large or some such. Maybe this is better in Go 1.19 given the new stack handling? Might be worth having a look at some typical stack sizes in steady-state clusters and see what's up, the runtime has some metrics for it.

erikgrinaker avatar Sep 23 '22 08:09 erikgrinaker

bors r=nvanbenschoten

arulajmani avatar Sep 26 '22 22:09 arulajmani

Build failed (retrying...):

craig[bot] avatar Sep 26 '22 23:09 craig[bot]

Build succeeded:

craig[bot] avatar Sep 27 '22 01:09 craig[bot]

I'm pretty sure that DistSender goroutine was to blame for the stack overflow, but I agree that the stack was suspiciously shallow for a 1 GB stack.

I also don't see how this adds up to 1GB. Subtracting the first stack pointer from the last in this goroutine does not give 1GB. And yet, we see a matching stack pointer in the panic text:

runtime: goroutine stack exceeds 1000000000-byte limit
runtime: sp=0xc1fb20c480 stack=[0xc1fb20c000, 0xc21b20c000]
fatal error: stack overflow

Notice that 0xc1fb20c480 is just beyond the address of the last stack frame that we see in the goroutine dump:

goroutine 5053528 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).reset(0xc066f4e700?, 0x1a3a00d3dd593d05?, 0x593a67a0ff4f5455?, {0x513f1f6?, 0x13?}, 0x4d1c58?, {0xc0bbb0299f62a59e?, 0xe2360df4425?, 0x9789580?}, 0xc01b2d06e0, ...)
        github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:577 +0xcd7 fp=0xc1fb20c490 sp=0xc1fb20c488 pc=0xc24937
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Tracer).newSpan(0x0?, 0x0?, 0x0?, {0x513f1f6, 0x13}, 0x0?, {0x0?, 0x0?, 0x9789580?}, 0xc01b2d06e0, ...)
        github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:971 +0x16f fp=0xc1fb20c570 sp=0xc1fb20c490 pc=0xc2b5af
...

So this goroutine does seem to be at fault.

nvanbenschoten avatar Sep 27 '22 13:09 nvanbenschoten