thanos Receive: fix serverAsClient.Series goroutines leak

Series method in the serverAsClient struct incurs goroutines leak when the request is stopped due to a context cancellation. When cancelling the context:

StoreServer.Series() returns an error of type context.Canceled
This errors is sent on an unbuffered channel
While at the same time, the same context's Done channel is consumed
Probably in most cases, the context Done channel is first consumed by the Recv() function
Having returned this error, Recv() is not called anymore, leaving a dangling goroutine

This also happens in proxy_heap lazyRespSet and eagerRespSet structures where the same context Done channel is again consumed in parallel to Recv() calls

I have removed parallel Recv() and ctx.Done() consumptions for the storepb.SeriesResponse types of clients.

I haven't included new tests because testing goroutines leaks in a deterministic manner is complicated, but don't hesitate if you have suggestions. However I validated this leak locally with a custom unit test.

This fix should reduce Receiver memory usage and unwanted crashes when OOMing.

[x] I added CHANGELOG entry for this change.
[ ] Change is not relevant to the end user.

Changes

Verification

Dec 01 '23 10:12 thibaultmg

Thanks for this fix. Maybe you could also upload the goroutine dump when this leak happens? :thinking: just so people would be able to find this problem easily. I have also been chasing a goroutine leak but I'm not sure if it's the same problem.

Dec 01 '23 10:12 GiedriusS

Thanks for this fix. Maybe you could also upload the goroutine dump when this leak happens? 🤔 just so people would be able to find this problem easily. I have also been chasing a goroutine leak but I'm not sure if it's the same problem.

Yes sure, here is what I have:

Flat	Cumulative	Name
449.31k	449.31k	runtime.gopark
1	1	github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).Uvarint64
0	449.31k	runtime.chansend
0	449.31k	runtime.chansend1
0	449.31k	github.com/thanos-io/thanos/pkg/store/storepb.serverAsClient.Series.func1
0	1	github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).Uvarint
0	1	github.com/prometheus/prometheus/tsdb/index.(*Decoder).Series
0	1	github.com/prometheus/prometheus/tsdb/index.(*Reader).Series
0	1	github.com/prometheus/prometheus/tsdb.blockIndexReader.Series
0	1	github.com/prometheus/prometheus/tsdb.(*blockBaseSeriesSet).Next
0	1	github.com/prometheus/prometheus/storage.(*genericMergeSeriesSet).Next
0	1	github.com/prometheus/prometheus/storage.(*lazyGenericSeriesSet).Next
0	1	github.com/thanos-io/thanos/pkg/store.(*TSDBStore).Series
0	1	github.com/thanos-io/thanos/pkg/store.(*recoverableStoreServer).Series

Dec 01 '23 14:12 thibaultmg

Hello, I have done a more in-depth study on the root cause of the goroutines leak, that still occurs with v34.1. Here is what happens:

We have set storeAPI limits that get regularly triggered on the receive
While streaming the response, the limiter gets triggered, returns an error from limitedServer.Send()
This error in turn is propagated to ProxyStore.Series() that returns with an error
The Series() error is propagated to the gRPC server that returns an error to the client and cancels the context of the request.
At that point, there are three goroutines still running:
- G1: lazyRespSet with handleRecvResponse()
- G2: TSDBStore.Series() that blocks on inProcessStream.Send()
- G3: spawned by serverAsClient.Series(), waiting for the Series() call to return in inSrv.err <- s.srv.Series(in, inSrv)

From there, we have several possible execution orders and several end up leaving G3 dangling such as:

G1 first reads the ctx.Done() channel and returns. G2 reads the ctx.Done() channel from the inProcessStream.Send() and returns. G3 has the Series() call return and blocks trying to write in the err channel that is not consumed anymore by G1. G3 is left dangling.

This can be seen on this graph (the container just started) where in ~50% of the cases where the request is limited, a goroutine is leaked. Screenshot 2024-03-28 at 17 17 13

I have deployed this PR for the past 24h in our staging environment and it prevents the leak without breaking other parts.

My questions are then:

Do you want me to reinsert the unit test showing the leak (it is in one of the commits)?
~~Units test are failing for what seems to be unrelated reasons, what do you think I should do about it?~~ Fixed with main merge.
Do you need additional information to make this PR move forward?

Thanks 😄

Apr 05 '24 08:04 thibaultmg

You can use the /debug/pprof/goroutine endpoint to see which goroutines are running when this happens.

Apr 12 '24 07:04 fpetkovski

You can use the /debug/pprof/goroutine endpoint to see which goroutines are running when this happens.

I am not sure what you mean. The PR fixes the leak as tested in our staging environment. The leaked goroutine is already identified. We are using the Parca profiling tool to scrape the pprof endpoints. This is how it was found.

Apr 12 '24 15:04 thibaultmg

@fpetkovski PR is ready.

Apr 15 '24 09:04 thibaultmg

Ack, let me know if you need anything @GiedriusS

Apr 17 '24 07:04 thibaultmg

No worries @GiedriusS , happy to see we're not the only ones having this issue 😅 . I just rebased.

May 16 '24 14:05 thibaultmg

No worries, thanks for merging this.

May 21 '24 07:05 thibaultmg

thanos thanos copied to clipboard

Receive: fix serverAsClient.Series goroutines leak

Changes

Verification

thanos
thanos copied to clipboard