dd-trace-js
dd-trace-js copied to clipboard
Async Hook instrumentation significantly increases response time for Promise-heavy calls
Describe the bug
We noticed a significant increase in response times for GraphQL queries that produce large payloads after we enabled dd-trace for our project.
Here's what we've seen:
| Date | Node Version | dd-trace Version | scope Setting |
Response Size | Response Time without dd-trace |
Response Time with dd-trace |
Percentage Increase |
|---|---|---|---|---|---|---|---|
| 8/12/2020 | 13.14.0 | 0.20.3 | async_hooks |
~1.8MB | 2.8s | 30s | 971.4% |
| 8/12/2020 | 14.8.0 | 0.24.0 | async_local_storage |
~1.8MB | 2.7s | 7.4s | 174.1% |
| 8/12/2020 | 14.8.0 | 0.24.0 | async_local_storage |
~150kb | 140ms | 220ms | 57.1% |
| 9/23/2020 | 14.8.0 | 0.26.0 | async_local_storage |
~3.2MB | 3.7s | 8.7s | 135.1% |
| 9/23/2020 | 14.8.0 | 0.26.0 | noop |
~3.2MB | 3.7s | 3.8s | 2.7% |
| 9/23/2020 | 14.8.0 | async_resource branch |
async_resource |
~3.2MB | 3.7s | 9.9s | 167.6% |
| 9/23/2020 | 14.8.0 | N/A (disabled) const { createHook } = require('async_hooks'); createHook({ init() {} }).enable(); |
N/A (disabled) | ~3.2MB | 3.7s | 7.4s | 100% |
As you can see, it looks like async_hooks are the primary reason we see the slowdown, and not necessarily anything specific to dd-trace.
The logic that triggers this especially pronounced behavior uses the following:
- Joist ORM, which uses dataloader under the hood
- TypeScript transpiled code, targetting ES2019, meaning we're using
async/awaitvs.Promise
The query itself produces many promises (I'm unsure of the exact number, but probably in the thousands), which we believe is exacerbating the problem.
I was working synchronously with @rochdev and @stephenh on this issue via the Datadog Slack. He mentioned that he'll add some additional detail to this issue.
Environment
- Operation system: Docker -
node:14.8.0image running on macOS host - Node version: 14.8.0
- Tracer version: 0.26.0
- Agent version: 7.22.0
For context, here is a bit of history behind this issue since it's been problematic for a while now.
Historically, context propagation APIs for Node have been using monkey patching on asynchronous code to track the asynchronous context. This approach has been used successfully by APM vendors for many years, even though it didn't support native code and could lose the context in some cases.
When async/await landed, monkey patching was no longer an option so everyone had to switch to async_hooks which is the official construct provided by Node to track the asynchronous context. Even though the module is still experimental to this day, it's the only way currently to do this. This is why it's always been the default implementation used by our scope manager in versions of Node that support it.
The main issue with async_hooks is that it's very heavy when used with promises. This is because the promise hooks exposed by v8 require a jump from JavaScript to native and back for every promise. There has been work done in the past few months to make async_hooks significantly lighter when a destroy hook is not used, for example when using AsyncLocalStorage, but there is still a fair bit of overhead when a lot of promises are created in quick successions. This means that even though we have a new scope manager that no longer relies on the destroy hook, we still incur this overhead.
Since the overhead comes from Node and not from the tracer, and we have to use this feature to support automatic context propagation, this is unfortunately an overhead we can't realistically avoid at the tracer level. It's also worth noting that in most cases, the overhead is low enough that it's not problematic.
In order to fix this completely, we'll have to fix async_hooks in Node itself, and also the corresponding promise hooks in v8. This is not small task and will take a while, but we'll update this issue as progress is done.
Progress for the v8 change can be tracked here.
Progress on the Node side can be tracked in https://github.com/nodejs/node/pull/36394
In Node.js v16.2.0, significant performance improvements in async_hooks are present, so if you can test with this version of Node.js, hopefully that alleviates a lot of the overhead.
@bengl I dont think there will be any improvement, even worse, I assume there will be some degradation according to this PR https://github.com/nodejs/node/pull/36394 when using destroy hook (you are using it), there is some small degradation in performance
Before:
$ ./node-master benchmark/async_hooks/promises.js
async_hooks/promises.js asyncHooks="enabled" n=1000000: 348,189.5714361237
async_hooks/promises.js asyncHooks="enabledWithDestroy" n=1000000: 212,124.80165853762
async_hooks/promises.js asyncHooks="enabledWithInitOnly" n=1000000: 391,763.36348583025
async_hooks/promises.js asyncHooks="disabled" n=1000000: 1,670,456.5232560865
After:
$ ./node benchmark/async_hooks/promises.js
async_hooks/promises.js asyncHooks="enabled" n=1000000: 1,153,166.7820469614
async_hooks/promises.js asyncHooks="enabledWithDestroy" n=1000000: 207,516.71584447258
async_hooks/promises.js asyncHooks="enabledWithInitOnly" n=1000000: 1,043,624.1457311497
async_hooks/promises.js asyncHooks="disabled" n=1000000: 1,603,204.3180576847
Here u set the destroy hook https://github.com/DataDog/dd-trace-js/blob/master/packages/dd-trace/src/scope/async_hooks.js#L37
Here u set the destroy hook
The async_hooks scope manager is only used in old versions of Node. Any recent versions such as latest 12.x, 14.x or 16.x use instead the async_resource scope manager which only uses init.
Fwiw today we tried bumping to node 16.2.0 to get this perf fix, and are seeing errors in our app that does this ~admittedly somewhat esoteric pattern with async_hooks:
import { AsyncLocalStorage } from "async_hooks";
...
export const currentFlushSecret = new AsyncLocalStorage<{ flushSecret: number }>();
....
currentFlushSecret.run({ flushSecret: this.flushSecret }, async () => {
// ...call various methods that do...
const { flushSecret } = currentFlushSecret.getStore() || {};
// ...and ensure flushSecret is the expected value
if (flushSecret !== this.flushSecret) {
throw new Error("invalid usage detected");
}
});
Where the idea is that only code that is specifically ran within the run callback is allowed to touch/mutate the this / EntityManager instance, and all others mutations will be rejected. Basically we're using it as a hacky thread-local-ish mutex/lock.
Previously this "invalid usage detected" wasn't hit in our application's code, but it is being hit now in our test suite, either insinuating that a) our application code had a bug that node 16.1.0 wasn't catching because this AsyncLocalStorage was propagating a flushSecret when it wasn't supposed to, or b) node 16.2.0 has a regression where AsyncLocalStorage is not propagating flushSecret (to one of the methods invoked directly in, or transitively from, the run callback) when it is supposed to.
The ^ code is part of an open source project:
https://github.com/stephenh/joist-ts/blob/main/packages/orm/src/EntityManager.ts#L202
Although our currently failing tests are in an internal project. If we run the joist-ts public tests on node 16.2.0, they work just fine (and we do have tests that specifically exercise the flush secret behavior), so we don't have a repro yet.
We're going to work more tomorrow on getting a repro done, in the public joist project, and verifying whether it is truly a regression in 16.2.0 or else just a bug/misuse of AsyncLocalStorage (in this joist project) or a big/misuse in our application itself (that is now getting correctly detected).
So, I dunno, this is not a super-useful update yet, but mostly wanted to post as an FYI in case anyone on the datadog/nodejs side of things would look at the AsyncLocalStorage usage and immediately intuit "oh yes we're doing something wrong" or "oh shoot something in the node 16.2.0 change over looked this".
cc @Qard who would have the most insight to determine at a glance if there could be a regression causing this issue.
Okay @rochdev and @Qard , we believe we have a reproduction in this PR:
https://github.com/stephenh/joist-ts/pull/122/commits
Notice that it passes on 16.1.0:
https://app.circleci.com/pipelines/github/stephenh/joist-ts/547/workflows/247c2b9d-8d01-463f-bfce-f1163a36f221/jobs/581
But fails on 16.2.0:
https://app.circleci.com/pipelines/github/stephenh/joist-ts/547/workflows/247c2b9d-8d01-463f-bfce-f1163a36f221/jobs/579
A few notes:
- This was actually easy to reproduce in this open source project, once we stopped passing
--detectOpenHandlesto jest - Similarly, running the tests with
--inspect-brk(i.e. debugging the issue in webstorm/etc) makes the issue go away and the tests pass - You can run an individual test i.e. this one and see the issue without running the whole test suite
- The
Author.test.tsassumes you run themake dbin thepackages/integration-testfolder to get the postgres db setup and running. Also ayarn+yarn build -win the top-level directly is useful. - I've added some debugging code that shows the issue in this file
Specifically we log the intent to do an AsyncLocalStorage.run:
console.log(`Starting new AsyncLocalStorage.run with flushSecret=${this.flushSecret}`);
On the very first line within that run callback, we call getStore() to look up our local storage value:
console.log(`Inside AsyncLocalStorage.run, flushSecret=${b}`);
And this works, it returns 1. We then do a few await ...helper methods... and then, within the same run callback that did the "Inside" log, do another console.log:
console.log(`Inside AsyncLocalStorage.run, flushSecret=${a}`);
And this one is broken, we get back undefined instead of 1.
So, at some point between line 715 and line 731, we have lost track of the AsyncLocalStorage async resource.
Happy to chat more, but does this give enough direction to reproduce on your side? Apologies that this project doesn't have the best "Getting Started"/etc documentation, so let me know if there are any gotchas in getting in running.
(Also, technically this is likely an issue in node itself, as this joist open source project doesn't use ddtrace, we just happened to notice the issue while getting our internal project that uses both ddtrace and joist onto node 16.2.0, to leverage the ddtrace perf wins in this ticket. Would you like us open a new issue against nodejs directly? Happy to let you guys do that. Disclaimer assuming we're not doing something dumb on our side.)
Thanks!
@stephenh I'm able to reproduce with the project you've shared, but without being able to debug and with a codebase I don't know it's difficult to try to find the issue. Do you think you'd be able to extract only a small part of the code that would still reproduce the issue? Or maybe find the exact line where the context is lost? I'm not super familiar with Jest either which definitely doesn't help 😅
@rochdev that's a fair ask! I think we were so pleased with ourselves to having isolated the repro that we didn't push farther to make it as simple as possible. :-)
It is terribly ironic you mention jest, because that might be the issue...
I've created a minimal ~10 lines of code repro here:
https://github.com/stephenh/async-local-storage-repro
I'll defer to the readme for steps, but basically if you run code that does: 1) access AsyncLocalStorage.getStore 2) await a promise, 3) access AsyncLocalStorage.getStore again, on node 16.2.0 and jest, the getStore value is lost and comes back as undefined.
The matrix is:
- Node 16.2.0 + just node: works
- Node 16.2.0 + via jest: broken
- Node 16.1.0 + just node: works
- Node 16.1.0 + via jest: works
Possible cause of the new issue since 16.2.0: https://github.com/nodejs/node/issues/38781
It looks like a fix should land in Node 16.3.0.
Closing as this seems to be resolved. If you disagree or there's anything I missed, let me know and I can reopen.
Sadly, I think this is still a problem with async_hooks in general. I tested a particularly slow/complex query on a client's backend in a few various scenarios (see also #2385). Node 18.13.0 in docker on MacOS.
| JIT enabled? | dd-trace enabled? | Average response time over 50 requests |
|---|---|---|
| √ | X | 4064.92 ms |
| X | X | 4863.24 ms |
| √ | √ | 5898.32 ms |
| X | √ | 8325.02 ms |
When JIT is enabled, it significantly reduces the number of spans in our traces. As you can see from the results, enabling dd-trace makes our requests significantly slower.
I was talking with @rochdev and he had me add a no-op async_hook to try to separate dd-trace from general async_hook overhead. The hook looked like this:
import { createHook } from "node:async_hooks";
createHook({ init() {}, before() {}, after() {}, destroy() {} }).enable();
Just having an async_hook without dd-trace set up increased the average response time from 4064.92 ms to 7994.8 ms. So, even though dd-trace is adding some additional overhead, the vast majority of the problems comes from async_hooks.
It seems like maybe the async_hooks approach might be a dud, since they're so unperformant?
Yep, for exactly this reason there's been a bunch of discussion in Node.js core about a more performant replacement for AsyncLocalStorage which does not use async_hooks internally. There's unfortunately no alternative currently, so your option is accept some overhead or have no observability. It's a lot less overhead than it used to be, but can still be noticeable depending on how heavily an application uses promises. I'm hopeful that we'll have something much better in future Node.js versions though. :)
Discussion about the above can be found in https://github.com/nodejs/node/issues/46265