thanos icon indicating copy to clipboard operation
thanos copied to clipboard

all: remove a bunch of internal spans

Open MichaHoffmann opened this issue 1 year ago • 8 comments

On big thanos deployments we can have queries that emit several thousand spans. This is clearly not feasible so I propose that we start cleaning up some internal ones and start relying on the spans that are generated at the boundaries by grpc or http middleware instead.

  • [ ] I added CHANGELOG entry for this change.
  • [ ] Change is not relevant to the end user.

Changes

  • removed a bunch of spans

Verification

  • previous tests

MichaHoffmann avatar Dec 18 '23 17:12 MichaHoffmann

Can we still keep those store gateway spans? They are still useful to understand store gateway performance. Without those spans, how do we understand which part is slow when requesting SG?

yeya24 avatar Dec 18 '23 22:12 yeya24

I'm not an thanos developer / expert so I'm not sure if those span are useful in some edge cases. Is it possible to have some configurations like DEBUG level tracing so that they can still be reported.

But me personally prefer fewer spans because we mainly want to know which components a promql goes through, and how much time they've costed. The opmization is great for my case, not sure if it fits other people's requirements (especially for developers and maintainers)

jiekun avatar Dec 19 '23 02:12 jiekun

@yeya24 we could use a separate tracer for store-gw and inject a nopTracer in Thanos. However, I think profiling would be a better fit to understand the internal performance of component.

fpetkovski avatar Dec 19 '23 06:12 fpetkovski

Aren't the gRPC/HTTP spans enough to debug storage gateway though?

MichaHoffmann avatar Dec 19 '23 07:12 MichaHoffmann

Aren't the gRPC/HTTP spans enough to debug storage gateway though?

From my own experience, the gRPC level span only tells you how long Series call could take, but you still need to figure it out whether it is slow fetching postings or fetching series/chunks, or the request itself got queued. If it is slow fetching postings, it can be slow fetching either cache or object store. To me, internal spans are still useful.

However, I think profiling would be a better fit to understand the internal performance of component.

Yeah profiling is helpful but at least from my own experience it doesn't tell you what's slow for a specific request and hard to map it

On big thanos deployments we can have queries that emit several thousand spans. This is clearly not feasible so I propose that we start cleaning up some internal ones and start relying on the spans that are generated at the boundaries by grpc or http middleware instead.

Can you sample less or simply drop those spans at collector level (incase you are using OTEL collector)

yeya24 avatar Dec 20 '23 01:12 yeya24

Aren't the gRPC/HTTP spans enough to debug storage gateway though?

From my own experience, the gRPC level span only tells you how long Series call could take, but you still need to figure it out whether it is slow fetching postings or fetching series/chunks, or the request itself got queued. If it is slow fetching postings, it can be slow fetching either cache or object store. To me, internal spans are still useful.

However, I think profiling would be a better fit to understand the internal performance of component.

Yeah profiling is helpful but at least from my own experience it doesn't tell you what's slow for a specific request and hard to map it

On big thanos deployments we can have queries that emit several thousand spans. This is clearly not feasible so I propose that we start cleaning up some internal ones and start relying on the spans that are generated at the boundaries by grpc or http middleware instead.

Can you sample less or simply drop those spans at collector level (incase you are using OTEL collector)

But all of those ( fetching from cache or object storage ) should have spans still if they are external calls. Out of curiosity; if i drop at the collector level; will the traces still be connected?

MichaHoffmann avatar Dec 20 '23 08:12 MichaHoffmann

But all of those ( fetching from cache or object storage ) should have spans still if they are external calls

They are just examples. But internal spans like fetching block series and merging are removed in this pr.

Out of curiosity; if i drop at the collector level; will the traces still be connected?

I am not sure but I guess it might not work well. It probably also depends on how the tracing service visualizes a single trace (whether it displays the spans missing parents or not)

yeya24 avatar Dec 20 '23 13:12 yeya24

But all of those ( fetching from cache or object storage ) should have spans still if they are external calls

They are just examples. But internal spans like fetching block series and merging are removed in this pr.

Out of curiosity; if i drop at the collector level; will the traces still be connected?

I am not sure but I guess it might not work well. It probably also depends on how the tracing service visualizes a single trace (whether it displays the spans missing parents or not)

I think if they are missing parent then nothing is connecting them to the root span; so if you sample inner spans that hide a nested RPC call away then the RPC call should be missing i think, right?

MichaHoffmann avatar Dec 20 '23 19:12 MichaHoffmann