thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Receive: fix serverAsClient.Series goroutines leak

Open thibaultmg opened this issue 1 year ago • 7 comments

Series method in the serverAsClient struct incurs goroutines leak when the request is stopped due to a context cancellation. When cancelling the context:

This also happens in proxy_heap lazyRespSet and eagerRespSet structures where the same context Done channel is again consumed in parallel to Recv() calls

I have removed parallel Recv() and ctx.Done() consumptions for the storepb.SeriesResponse types of clients.

I haven't included new tests because testing goroutines leaks in a deterministic manner is complicated, but don't hesitate if you have suggestions. However I validated this leak locally with a custom unit test.

This fix should reduce Receiver memory usage and unwanted crashes when OOMing.

  • [x] I added CHANGELOG entry for this change.
  • [ ] Change is not relevant to the end user.

Changes

Verification

thibaultmg avatar Dec 01 '23 10:12 thibaultmg

Thanks for this fix. Maybe you could also upload the goroutine dump when this leak happens? :thinking: just so people would be able to find this problem easily. I have also been chasing a goroutine leak but I'm not sure if it's the same problem.

GiedriusS avatar Dec 01 '23 10:12 GiedriusS

Thanks for this fix. Maybe you could also upload the goroutine dump when this leak happens? 🤔 just so people would be able to find this problem easily. I have also been chasing a goroutine leak but I'm not sure if it's the same problem.

Yes sure, here is what I have:

Flat Cumulative Name
449.31k 449.31k runtime.gopark
1 1 github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).Uvarint64
0 449.31k runtime.chansend
0 449.31k runtime.chansend1
0 449.31k github.com/thanos-io/thanos/pkg/store/storepb.serverAsClient.Series.func1
0 1 github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).Uvarint
0 1 github.com/prometheus/prometheus/tsdb/index.(*Decoder).Series
0 1 github.com/prometheus/prometheus/tsdb/index.(*Reader).Series
0 1 github.com/prometheus/prometheus/tsdb.blockIndexReader.Series
0 1 github.com/prometheus/prometheus/tsdb.(*blockBaseSeriesSet).Next
0 1 github.com/prometheus/prometheus/storage.(*genericMergeSeriesSet).Next
0 1 github.com/prometheus/prometheus/storage.(*lazyGenericSeriesSet).Next
0 1 github.com/thanos-io/thanos/pkg/store.(*TSDBStore).Series
0 1 github.com/thanos-io/thanos/pkg/store.(*recoverableStoreServer).Series

thibaultmg avatar Dec 01 '23 14:12 thibaultmg

Hello, I have done a more in-depth study on the root cause of the goroutines leak, that still occurs with v34.1. Here is what happens:

  • We have set storeAPI limits that get regularly triggered on the receive
  • While streaming the response, the limiter gets triggered, returns an error from limitedServer.Send()
  • This error in turn is propagated to ProxyStore.Series() that returns with an error
  • The Series() error is propagated to the gRPC server that returns an error to the client and cancels the context of the request.
  • At that point, there are three goroutines still running:
    • G1: lazyRespSet with handleRecvResponse()
    • G2: TSDBStore.Series() that blocks on inProcessStream.Send()
    • G3: spawned by serverAsClient.Series(), waiting for the Series() call to return in inSrv.err <- s.srv.Series(in, inSrv)

From there, we have several possible execution orders and several end up leaving G3 dangling such as:

  • G1 first reads the ctx.Done() channel and returns. G2 reads the ctx.Done() channel from the inProcessStream.Send() and returns. G3 has the Series() call return and blocks trying to write in the err channel that is not consumed anymore by G1. G3 is left dangling.

This can be seen on this graph (the container just started) where in ~50% of the cases where the request is limited, a goroutine is leaked. Screenshot 2024-03-28 at 17 17 13

I have deployed this PR for the past 24h in our staging environment and it prevents the leak without breaking other parts.

My questions are then:

  • Do you want me to reinsert the unit test showing the leak (it is in one of the commits)?
  • ~~Units test are failing for what seems to be unrelated reasons, what do you think I should do about it?~~ Fixed with main merge.
  • Do you need additional information to make this PR move forward?

Thanks 😄

thibaultmg avatar Apr 05 '24 08:04 thibaultmg

You can use the /debug/pprof/goroutine endpoint to see which goroutines are running when this happens.

fpetkovski avatar Apr 12 '24 07:04 fpetkovski

You can use the /debug/pprof/goroutine endpoint to see which goroutines are running when this happens.

I am not sure what you mean. The PR fixes the leak as tested in our staging environment. The leaked goroutine is already identified. We are using the Parca profiling tool to scrape the pprof endpoints. This is how it was found.

thibaultmg avatar Apr 12 '24 15:04 thibaultmg

@fpetkovski PR is ready.

thibaultmg avatar Apr 15 '24 09:04 thibaultmg

Ack, let me know if you need anything @GiedriusS

thibaultmg avatar Apr 17 '24 07:04 thibaultmg

No worries @GiedriusS , happy to see we're not the only ones having this issue 😅 . I just rebased.

thibaultmg avatar May 16 '24 14:05 thibaultmg

No worries, thanks for merging this.

thibaultmg avatar May 21 '24 07:05 thibaultmg