dd-trace-go
dd-trace-go copied to clipboard
tracer: add support for partial flushing
Description
Adding support for the following environment variables
DD_TRACER_PARTIAL_FLUSH_ENABLEDDD_TRACER_PARTIAL_FLUSH_MIN_SPANS
These variables are documented and available for ruby, python, and java dd trace libraries.
Why
Currently working through a project where there is a long running process (can take 1hr+ for the process to finish). The process has a single root span that that covers the entire process but crashes with the following error with the current dd-trace-go library:
Datadog Tracer v1.34.0 ERROR: trace buffer full (100000), dropping trace (occurred: 10 Aug 22 17:31 UTC)
By adding support for partial flushes, this would help resolve this issue to my understanding
If it helps, I can submit a PR, just need guidance on how to navigate the codebase to implement this change
Any updates on this issue
Hi @Ydot19 - thanks for bringing this up. For our own understanding, would you be able to explain your use case here? You mentioned that you are generating 100K spans over the course of an hour+ for a single trace. How do you expect to use these spans? Do you need the trace spans, or just the metrics?
We're going to close this due to inactivity, but please file a new issue or reach out to your customer rep / support if this is still an issue for you.
We have run into this issue on multiple occasions or at least the side effect of it. Some of our service are using websockets and developers naturally pass the context of the initial http request into the socket, which essentially causes that the service accumulates spans for the duration of the socket and eventually crashes with an OOM.
So our use-case isn't really that we need long traces, but that spans should ideally be capped at a configurable limit to prevent such mistakes by default.
Thanks @johanneswuerbach. Will re-open so we can keep discussing this.
@katiehockman Facing similar memory bloat and ultimately OOM while tracing bidirectional gRPC streams that produce long-running traces. pprof showed setMeta and setMeric holding most of the memory. Once the stream is closed, memory utilization comes back to normal.
Thanks for the details and analysis @Abhishekvrshny. We'll get back to you about this shortly.
Any update on this? Would would love to use tracing to troubleshoot issues on websockets, but those are potentially long running and might accumulate a lot of spans.
Hi @johanneswuerbach this is something that we're actively working on. I will note though that partial flushing will not help for long running spans directly. Long running spans (>1hr) may not display correctly in the user interface and may be difficult to navigate. Partial flushing is more specifically good for relieving memory pressure in situations where you have many finished spans under an unfinished span. If you have a use case for long running spans definitely feel free to open a ticket with Datadog Support so we can best prioritize that work!
Hey @ajgajg1134 in our case the websocket itself (long-span) is also not the interesting bit, but more the short spans that happen during its lifecycle.
Beta support for partial flushing has now been released in dd-trace-go! Please try it out and give us feedback, and let support know if you run into any issues.
Thank you @katiehockman