dd-trace-go tracer: add support for partial flushing

Description

Adding support for the following environment variables

DD_TRACER_PARTIAL_FLUSH_ENABLED
DD_TRACER_PARTIAL_FLUSH_MIN_SPANS

These variables are documented and available for ruby, python, and java dd trace libraries.

Why

Currently working through a project where there is a long running process (can take 1hr+ for the process to finish). The process has a single root span that that covers the entire process but crashes with the following error with the current dd-trace-go library:

Datadog Tracer v1.34.0 ERROR: trace buffer full (100000), dropping trace (occurred: 10 Aug 22 17:31 UTC)

By adding support for partial flushes, this would help resolve this issue to my understanding

If it helps, I can submit a PR, just need guidance on how to navigate the codebase to implement this change

Aug 15 '22 16:08 yaadata

Any updates on this issue

Aug 26 '22 21:08 yaadata

Hi @Ydot19 - thanks for bringing this up. For our own understanding, would you be able to explain your use case here? You mentioned that you are generating 100K spans over the course of an hour+ for a single trace. How do you expect to use these spans? Do you need the trace spans, or just the metrics?

Sep 02 '22 16:09 katiehockman

We're going to close this due to inactivity, but please file a new issue or reach out to your customer rep / support if this is still an issue for you.

Nov 15 '22 15:11 katiehockman

We have run into this issue on multiple occasions or at least the side effect of it. Some of our service are using websockets and developers naturally pass the context of the initial http request into the socket, which essentially causes that the service accumulates spans for the duration of the socket and eventually crashes with an OOM.

So our use-case isn't really that we need long traces, but that spans should ideally be capped at a configurable limit to prevent such mistakes by default.

Nov 15 '22 19:11 johanneswuerbach

Thanks @johanneswuerbach. Will re-open so we can keep discussing this.

Nov 15 '22 21:11 katiehockman

@katiehockman Facing similar memory bloat and ultimately OOM while tracing bidirectional gRPC streams that produce long-running traces. pprof showed setMeta and setMeric holding most of the memory. Once the stream is closed, memory utilization comes back to normal.

Nov 20 '22 15:11 Abhishekvrshny

Thanks for the details and analysis @Abhishekvrshny. We'll get back to you about this shortly.

Nov 21 '22 14:11 katiehockman

Any update on this? Would would love to use tracing to troubleshoot issues on websockets, but those are potentially long running and might accumulate a lot of spans.

May 31 '23 20:05 johanneswuerbach

Hi @johanneswuerbach this is something that we're actively working on. I will note though that partial flushing will not help for long running spans directly. Long running spans (>1hr) may not display correctly in the user interface and may be difficult to navigate. Partial flushing is more specifically good for relieving memory pressure in situations where you have many finished spans under an unfinished span. If you have a use case for long running spans definitely feel free to open a ticket with Datadog Support so we can best prioritize that work!

May 31 '23 20:05 ajgajg1134

Hey @ajgajg1134 in our case the websocket itself (long-span) is also not the interesting bit, but more the short spans that happen during its lifecycle.

May 31 '23 20:05 johanneswuerbach

Beta support for partial flushing has now been released in dd-trace-go! Please try it out and give us feedback, and let support know if you run into any issues.

Aug 10 '23 19:08 katiehockman

Thank you @katiehockman

Aug 10 '23 20:08 yaadata

dd-trace-go dd-trace-go copied to clipboard

tracer: add support for partial flushing

Description

Why

dd-trace-go
dd-trace-go copied to clipboard