tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Compactors to distinguish 'importance' of traces by policy

Open jvilhuber opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. I'm adding tail-based sampling (via otel-collector) to my tracing solution, and while that helps reduce the amount of traces I have in the backend, I'd love a little more flexibility in how long some traces (versus others) are kept.

"Normal" (by some definition) traces are interesting for a while at first, but are rarely looked at after a short while (retention=1week?), but "odd" traces (errors, slow, http.status_code=500, etc) might be interesting for a longer period of time (retention=1month?).

Describe the solution you'd like Looking at the tail-based policies, I wondered if it wouldn't make sense to have a similar (or greatly simplified, perhaps) set of policies whereby we might keep error and slow traces (i.e. anything matching some policy definition) for a longer period of time than "normal" traces (non-error, and fast).

Describe alternatives you've considered Using tail-based sampling we already greatly reduce the amount of "normal" traces. I see no other way to selectively change the retention period of traces in the backend.

Additional context https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor

jvilhuber avatar May 06 '22 08:05 jvilhuber

This is a cool feature 👍

Is the primary intention to store "interesting" traces for a longer period or to reduce storage costs by downsampling in the compactor?

From some internal metrics, storage is just about ~30% of our operational cost, so if we are expecting to reduce operational cost by sampling down in the compactor I'm not sure how effective that would be.

OTOH, if the intention is to have similar functionality to "archival of traces" as in Jaeger, then I am 100% on board.

Additionally, the backend format is in some flux at the moment, we are migrating to full Parquet blocks and it might be worth waiting till that settles before implementing this.

annanay25 avatar May 06 '22 12:05 annanay25

I think for the purpose of this ticket, I'd say it's to reduce storage costs and speed up searching by removing the less interesting traces sooner. The notion of 'downsampling' into metrics, is a separate (but interesting) discussion, I suspect.

But you have a point with the new storage format: Definitely something to keep on the back-burner for after that switchover (once we see how that shakes out).

jvilhuber avatar May 09 '22 08:05 jvilhuber

Another thought: I'm not really happy with tail-sampling in general. It has little to do with any implementation (say in otel-collector or grafana-agent), and more to do with the place it's done: the middle.

Consider: any collector-type thing (otel-collector and grafana-agent, for example) will need to wait some time for 'all' spans to come in to make a decision. If I make that "decision-time" really long, I use up more memory on the collector. What's a good time? Any time I pick will inevitably be shorter than SOME amount of traces that are just really long! And those are precisely the ones I need to look at to determine "what the heck took this request so long".

I was just looking at some traces where the interesting part is simply missing! After 30s (or whatever), otel-collector decided enough is enough and declared my policy of "slow requests" was matched and sampled the request and sent it on. However, some spans that were the culprits I'm waiting on, took LONGER than the 30s and came in later, and thus weren't part of the trace in tempo.

Ingesters and compactors already deal with the heavy lifting of waiting for spans, organizing them into traces, adding late spans to trace-blocks, etc. And they even do this on disk, instead of just memory, allowing for more flexibility in durations and storage-capacity. Duplicating most of that in some collector in the middle seems bug-prone at best.

In the backend I could configure a MUCH longer time-frame to wait for a decision (minutes. days even!): Ingesters can keep adding late spans to the trace and compactors can later go through and delete traces that are "complete" but not "interesting" (and past some time-value for expiration).

Sorry for the rambling.

EDIT: maybe my tail-based sampling with trace-id-load-balancing is misconfigured. That's possible, too.

jvilhuber avatar May 10 '22 09:05 jvilhuber

These are some good points. I agree that tail based sampling in the OTel Collector/Grafana Agent has some challenges. The points you are making are spot on. There are certainly some use cases where it works quite well, but configurations can be brittle and it does delay moving the trace to the backend (as you have seen).

I, personally, love the originally requested feature. It's something we've talked about implementing a few times, but have always had higher priorities. Keeping traces in object storage is generally cheap and so we've often been focused on other elements of Tempo.

With new Parquet blocks and TraceQL coming up we will have our hands full for a bit, but I'd like to revisit this once those features are in place and our backend has settled. Thanks!

joe-elliott avatar May 16 '22 13:05 joe-elliott

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Nov 15 '22 00:11 github-actions[bot]

One use case i have in mind for feature like this is (normal vs abnormal) traces (abnormal can leave up to agent to decide or user to configure), but the purpose is more focusing on back tracing a problem that may beyond the retention period. Example, some end user of my application may report a problem that happened 14 days ago, but my retention for the traces only kept for 7 days. This feature could potentially enable the maintainer to find out exactly what happened and correlate this with other metrics/trace data. Now, it maybe up to argument about why we cannot just investigate all the errors when it happens by generating alert which stores somewhere else. This maybe the case if this is not available in tempo, however, it would make things a lot easier if this is already exists in tempo.

ghengao avatar Apr 11 '23 21:04 ghengao

I think TraceQL is a great fit here so you can express any arbitrary condition. Would love to see something like:

retention:
  - query: { error = true || span.http.status_code >= 500 }
    retention: 30d
  - query: {}         <-- default case matches everything else
    retention: 7d

mdisibio avatar Apr 21 '23 15:04 mdisibio

Yeah, loki-styled retention streams will be great here. In loki i use them to keep debug for a day, while everything else is stored for longer amount of time. Default case is probably not needed, since it will just use default retention setting in tempo?

It will be great to keep errors for several months (rarely, but sometimes needed) while keeping "hot" data for shorter amount of time.

rlex avatar Apr 21 '23 15:04 rlex