dd-trace-js
dd-trace-js copied to clipboard
Add `flush()` method to tracer
What does this PR do?
Allows users to manually flush any collected spans to the exporter agent. Very useful in scenarios where the app manually calls process.exit()
such as the one described by @fredefox here:
const tracer = DD.init();
const res = await App.run(tracer)
await tracer.flush();
process.exit(res);
Motivation
Was implementing an app that handles SIGTERM in order to do a clean shutdown, and dd-trace did not export the span that contained the error to Datadog.
@rochdev Thanks for checking out the PR.
One issue that could arise is if the process takes more than the interval (default 2 seconds) then there would be an automatic flush happening and there would be no way to wait for it externally, and calling flush manually at that point would have no effect. Maybe it would make sense to add a -1 value to completely disable it so that flushing manually would be required?
That's a good point. We could conceivably also keep track of the last flush being performed (by the scheduler) and ensure that if you call flush()
directly. it would also wait for that ongoing flush.
EDIT: @rochdev pushed the above approach. Can you take a look?
Codecov Report
Merging #1227 (b924925) into master (32f9fb4) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #1227 +/- ##
=======================================
Coverage 92.57% 92.58%
=======================================
Files 150 150
Lines 6113 6121 +8
=======================================
+ Hits 5659 5667 +8
Misses 454 454
Impacted Files | Coverage Δ | |
---|---|---|
packages/dd-trace/src/exporters/agent/writer.js | 97.05% <100.00%> (+0.23%) |
:arrow_up: |
packages/dd-trace/src/noop/tracer.js | 78.94% <100.00%> (+1.16%) |
:arrow_up: |
packages/dd-trace/src/proxy.js | 91.80% <100.00%> (+0.13%) |
:arrow_up: |
packages/dd-trace/src/tracer.js | 96.10% <100.00%> (+0.05%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Is this PR stale? We were looking for this functionality today.
I would be willing to take it over if @nunofgs isn't working on it anymore.
Any updates on this ? We would be glad to contribute as well.
@extradosages @olup I think it's safe to say this PR is pretty stale. Are either of you still in need of this functionality?
I believe this still doesn't exist in the library and one still has to add some random delay at the end of a Cron, for instance, to make sure traces are sent. So I think this would still be very helpful. But I agree the or needs some love
I believe this still doesn't exist in the library and one still has to add some random delay at the end of a Cron, for instance, to make sure traces are sent.
In case it bumps the priority up a bit, this bites us in everything, not just short-lived jobs like cron. It's disconcerting to lose confidence in our primary o11y tooling during a process shutdown. I get that maintaining OSS is a thankless task (thank you!), but what can we do to move this PR forward? It's unacceptable IMO that I can't tell whether a pod that gets rescheduled has lost data or my tracer has just failed to flush its recordings. Does no one report problematic metrics and flaky alerts that can be attributed to this?
Without this API, the only way to get a truly graceful shutdown is at best a hopeful hack. Decreasing the flush interval has performance risks and is still nondeterministic. Increasing some global timeout before process.exit is also nondeterministic, and has additional pain points w.r.t. testing and local development, to say nothing of the finicky arithmetic one needs to do to set this timeout in coordination with any container orchestrator's timeouts.
@lukealbao Can you clarify your use case? From what I understand it seems what you're saying is that when an unhandled exception happen you would like to know about it. However, when this does indeed happen, then whatever you're able to do in that handler we should be able to do automatically as well. To be clear, I'm not arguing against exposing tracer.flush()
and happy to do so if the comments on this PR are addressed, but I think it would also be even better to have the most common use cases covered out of the box when the process is killed.
Hi @rochdev thanks for the reply -- no, my use case is not really different than the one described in the Motivation of the OP. We are using dd-trace in a long-lived service that has a code path for gracefully shutting down external handles before exiting the process, and it would be great to add the ability to do so for pending dd data. Currently we set a timeout greater than the flush interval, but this feels like too much finger-crossing.
My apologies, I didn't look close enough to notice that this PR seems to have been abandoned by the author, with lots of unfinished work. I'd be happy to commit to adopting this PR, but unfortunately my team is on v1 with no plans to upgrade at the moment, so I don't think I can find the time to do so. If we find the time to fork, we'll be sure to open a PR upstream. Thanks again.
I think this has passed my team by, unfortunately.
@rochdev The use case (for us at least) is that it's a fairly common flow to either:
(1) have an uncaughtException
/ unhandledRejection
handler which cleans up and then calls process.exit(1)
; or
(2) to have scripts such as crons explicitly call process.exit(0)
when done to force an immediate exit.
In both cases, calling process.exit(...)
makes it likely that some DD traces will be lost because of the async flush logic.
Many libraries with similar async flush logic (statsd, launchdarkly, segment to name a few off the top of my head) expose such a flush/close to let users implement a proper cleanup
sequence before exiting.
The workaround is to add an arbitrary setTimeout(3000)
instead before exiting but this leaves much to desire, because it's slow and not even guaranteed to work.