thanos
thanos copied to clipboard
all: remove a bunch of internal spans
On big thanos deployments we can have queries that emit several thousand spans. This is clearly not feasible so I propose that we start cleaning up some internal ones and start relying on the spans that are generated at the boundaries by grpc or http middleware instead.
- [ ] I added CHANGELOG entry for this change.
- [ ] Change is not relevant to the end user.
Changes
- removed a bunch of spans
Verification
- previous tests
Can we still keep those store gateway spans? They are still useful to understand store gateway performance. Without those spans, how do we understand which part is slow when requesting SG?
I'm not an thanos developer / expert so I'm not sure if those span are useful in some edge cases.
Is it possible to have some configurations like DEBUG
level tracing so that they can still be reported.
But me personally prefer fewer spans because we mainly want to know which components a promql goes through, and how much time they've costed. The opmization is great for my case, not sure if it fits other people's requirements (especially for developers and maintainers)
@yeya24 we could use a separate tracer for store-gw and inject a nopTracer
in Thanos. However, I think profiling would be a better fit to understand the internal performance of component.
Aren't the gRPC/HTTP spans enough to debug storage gateway though?
Aren't the gRPC/HTTP spans enough to debug storage gateway though?
From my own experience, the gRPC level span only tells you how long Series
call could take, but you still need to figure it out whether it is slow fetching postings or fetching series/chunks, or the request itself got queued. If it is slow fetching postings, it can be slow fetching either cache or object store. To me, internal spans are still useful.
However, I think profiling would be a better fit to understand the internal performance of component.
Yeah profiling is helpful but at least from my own experience it doesn't tell you what's slow for a specific request and hard to map it
On big thanos deployments we can have queries that emit several thousand spans. This is clearly not feasible so I propose that we start cleaning up some internal ones and start relying on the spans that are generated at the boundaries by grpc or http middleware instead.
Can you sample less or simply drop those spans at collector level (incase you are using OTEL collector)
Aren't the gRPC/HTTP spans enough to debug storage gateway though?
From my own experience, the gRPC level span only tells you how long
Series
call could take, but you still need to figure it out whether it is slow fetching postings or fetching series/chunks, or the request itself got queued. If it is slow fetching postings, it can be slow fetching either cache or object store. To me, internal spans are still useful.However, I think profiling would be a better fit to understand the internal performance of component.
Yeah profiling is helpful but at least from my own experience it doesn't tell you what's slow for a specific request and hard to map it
On big thanos deployments we can have queries that emit several thousand spans. This is clearly not feasible so I propose that we start cleaning up some internal ones and start relying on the spans that are generated at the boundaries by grpc or http middleware instead.
Can you sample less or simply drop those spans at collector level (incase you are using OTEL collector)
But all of those ( fetching from cache or object storage ) should have spans still if they are external calls. Out of curiosity; if i drop at the collector level; will the traces still be connected?
But all of those ( fetching from cache or object storage ) should have spans still if they are external calls
They are just examples. But internal spans like fetching block series and merging are removed in this pr.
Out of curiosity; if i drop at the collector level; will the traces still be connected?
I am not sure but I guess it might not work well. It probably also depends on how the tracing service visualizes a single trace (whether it displays the spans missing parents or not)
But all of those ( fetching from cache or object storage ) should have spans still if they are external calls
They are just examples. But internal spans like fetching block series and merging are removed in this pr.
Out of curiosity; if i drop at the collector level; will the traces still be connected?
I am not sure but I guess it might not work well. It probably also depends on how the tracing service visualizes a single trace (whether it displays the spans missing parents or not)
I think if they are missing parent then nothing is connecting them to the root span; so if you sample inner spans that hide a nested RPC call away then the RPC call should be missing i think, right?