dd-trace-go
dd-trace-go copied to clipboard
proposal: add support for more advanced trace filtering
I would like to get some thoughts on the feasability of a Sampler that would allow:
- Send all errors (or X% of errors)
- Send Y% of successful requests.
This is possible in other SDKs (e.g., in ddtrace-py with a custom sampler), and it would be helpful on large volumes to make sure errors are not missed due to sampling.
The current ddtrace.Span interface passed to the samples does not seem to provide us a way to know if an error has occured. The built-in samplers directly cast to the internal type span which is more powerful.
Hey @peay! Thanks for reaching out here. Have you taken a look at using our retention filters to solve this use case? https://docs.datadoghq.com/tracing/trace_retention_and_ingestion/
@andrewsouthard1 I considered this, but it leads to filtering happening directly in the Datadog cloud, and I am interested in reducing traffic to that to start with. Other SDKs (e.g., ddtrace-py) provide the ability to do a rather flexible sampling about specifics of the span, and I'd find it useful in the Go SDK as well.
One way you could most reliably do this today is to selectively create spans based on those aspects. For example, if you'd like 60% of errors to be sent, you could do something like this:
start := time.Now()
// your code
if err != nil {
withinP := span.SpanID() * uint64(1111111111111111111) < uint64(0.6*math.MaxUint64)
if withinP {
tracer.StartSpan("op.name", tracer.StartTime(start)).Finish(tracer.WithError(err))
}
}
Would this be acceptable?
Note however that doing this will cause APM stats to be incorrect because the trace-agent will be unaware of 40% of spans. For this reason, we do not recommend such behaviour.
Lastly, can you please point me to the API in dd-trace-py you speak of? For what it's worth noting, we do not want to extend the Sampler interface with additional implementations because we do not encourage its use except in rare situations with a big caveat saying that APM stats will become incorrect as a result. Instead, we encourage users to use the rule sampler.
Thanks for the thoughts @gbbr, that makes sense. Selectively creating spans is definitely an option, although if you're using one of the integrations (e.g., to trace HTTP requests of a web server), then there's already an automatically created span and you cannot resort to that.
The rationale for not allowing more general filtering of traces makes sense. My use case is to use APM mostly for the "Error tracking" feature a-la-Sentry. As such, we're not really interested in APM stats in general, in error rates, or in traces without errors, but only in traces with exceptions so that they can be grouped, triaged and fixed.
I think we can try to align on using retention filters only - it just feels inefficient to have to send through the network the 99.999%+ of traces that we are not interested in.
For the dd-trace-py API, see trace filtering. Definitely will not lead to accurate APM stats, but it is as flexible as it gets.
Got it. Thanks for clarifying and for sharing the link to the Python tracer. I'll check in with the team and mark this as a feature request. I propose we continue discussing options to do this here in this issue.
It seems that the Python tracer uses these filters to completely avoid sending these to the agent. This basically means that no APM stats nor APM Analytics events will be generated for these. They will be completely disregarded as if having never existed. Is this outcome acceptable to you @peay? If not, the alternative is to simply mark the filtered traces using ext.PriorityAutoReject, have them reach the agent but not the intake, get APM stats and potentially APM Analytics Events (if activated). The latter version incurs additional local network traffic.
I propose we stay consistent with the dd-trace-py implementations and skip these traces completely as if they never existed. If needed, we can always add an alternative behaviour to the filters in the future.
Thoughts?
I agree with dropping altogether. It would be more consistent with dd-trace-py and is also more in the direction we're already trying to go with the client - reducing traffic to the agent.
On the same note, we already have client-computed stats behind a feature flag, so it would be nice to make this work with that too.
Cool. Thanks @knusbaum. Let's wait for confirmation from @peay that this approach is ok and if so, close this issue and create a fresh proposal with the suggested design.
This basically means that no APM stats nor APM Analytics events will be generated for these. They will be completely disregarded as if having never existed. Is this outcome acceptable to you @peay?
Makes sense for our use case (error tracking) @gbbr, thanks for considering this!
Similar in concept to #360