thanos
thanos copied to clipboard
Receive: fix serverAsClient.Series goroutines leak
Series method in the serverAsClient struct incurs goroutines leak when the request is stopped due to a context cancellation. When cancelling the context:
- StoreServer.Series() returns an error of type
context.Canceled - This errors is sent on an unbuffered channel
- While at the same time, the same context's Done channel is consumed
- Probably in most cases, the context Done channel is first consumed by the Recv() function
- Having returned this error, Recv() is not called anymore, leaving a dangling goroutine
This also happens in proxy_heap lazyRespSet and eagerRespSet structures where the same context Done channel is again consumed in parallel to Recv() calls
I have removed parallel Recv() and ctx.Done() consumptions for the storepb.SeriesResponse types of clients.
I haven't included new tests because testing goroutines leaks in a deterministic manner is complicated, but don't hesitate if you have suggestions. However I validated this leak locally with a custom unit test.
This fix should reduce Receiver memory usage and unwanted crashes when OOMing.
- [x] I added CHANGELOG entry for this change.
- [ ] Change is not relevant to the end user.
Changes
Verification
Thanks for this fix. Maybe you could also upload the goroutine dump when this leak happens? :thinking: just so people would be able to find this problem easily. I have also been chasing a goroutine leak but I'm not sure if it's the same problem.
Thanks for this fix. Maybe you could also upload the goroutine dump when this leak happens? 🤔 just so people would be able to find this problem easily. I have also been chasing a goroutine leak but I'm not sure if it's the same problem.
Yes sure, here is what I have:
| Flat | Cumulative | Name |
|---|---|---|
| 449.31k | 449.31k | runtime.gopark |
| 1 | 1 | github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).Uvarint64 |
| 0 | 449.31k | runtime.chansend |
| 0 | 449.31k | runtime.chansend1 |
| 0 | 449.31k | github.com/thanos-io/thanos/pkg/store/storepb.serverAsClient.Series.func1 |
| 0 | 1 | github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).Uvarint |
| 0 | 1 | github.com/prometheus/prometheus/tsdb/index.(*Decoder).Series |
| 0 | 1 | github.com/prometheus/prometheus/tsdb/index.(*Reader).Series |
| 0 | 1 | github.com/prometheus/prometheus/tsdb.blockIndexReader.Series |
| 0 | 1 | github.com/prometheus/prometheus/tsdb.(*blockBaseSeriesSet).Next |
| 0 | 1 | github.com/prometheus/prometheus/storage.(*genericMergeSeriesSet).Next |
| 0 | 1 | github.com/prometheus/prometheus/storage.(*lazyGenericSeriesSet).Next |
| 0 | 1 | github.com/thanos-io/thanos/pkg/store.(*TSDBStore).Series |
| 0 | 1 | github.com/thanos-io/thanos/pkg/store.(*recoverableStoreServer).Series |
Hello, I have done a more in-depth study on the root cause of the goroutines leak, that still occurs with v34.1. Here is what happens:
- We have set storeAPI limits that get regularly triggered on the receive
- While streaming the response, the limiter gets triggered, returns an error from
limitedServer.Send() - This error in turn is propagated to
ProxyStore.Series()that returns with an error - The
Series()error is propagated to the gRPC server that returns an error to the client and cancels the context of the request. - At that point, there are three goroutines still running:
- G1:
lazyRespSetwithhandleRecvResponse() - G2:
TSDBStore.Series()that blocks oninProcessStream.Send() - G3: spawned by
serverAsClient.Series(), waiting for theSeries()call to return ininSrv.err <- s.srv.Series(in, inSrv)
- G1:
From there, we have several possible execution orders and several end up leaving G3 dangling such as:
- G1 first reads the
ctx.Done()channel and returns. G2 reads thectx.Done()channel from theinProcessStream.Send()and returns. G3 has the Series() call return and blocks trying to write in the err channel that is not consumed anymore by G1. G3 is left dangling.
This can be seen on this graph (the container just started) where in ~50% of the cases where the request is limited, a goroutine is leaked.
I have deployed this PR for the past 24h in our staging environment and it prevents the leak without breaking other parts.
My questions are then:
- Do you want me to reinsert the unit test showing the leak (it is in one of the commits)?
- ~~Units test are failing for what seems to be unrelated reasons, what do you think I should do about it?~~ Fixed with main merge.
- Do you need additional information to make this PR move forward?
Thanks 😄
You can use the /debug/pprof/goroutine endpoint to see which goroutines are running when this happens.
You can use the /debug/pprof/goroutine endpoint to see which goroutines are running when this happens.
I am not sure what you mean. The PR fixes the leak as tested in our staging environment. The leaked goroutine is already identified. We are using the Parca profiling tool to scrape the pprof endpoints. This is how it was found.
@fpetkovski PR is ready.
Ack, let me know if you need anything @GiedriusS
No worries @GiedriusS , happy to see we're not the only ones having this issue 😅 . I just rebased.
No worries, thanks for merging this.