SIGSEGV when runtimeMetrics=true in a worker_threads worker when it terminates
Seems like dd-trace's runtimeMetrics=true can't work inside a worker thread when this thread's terminate() is called.
Expected behaviour
It should've worked and printed nothing.
Actual behaviour
node main.js
PID 29166 received SIGSEGV for address: 0x10f69cfac
0 segfault-handler.node 0x000000010f68534c _ZL16segfault_handleriP9__siginfoPv + 288
1 libsystem_platform.dylib 0x00000001ab4eb4a4 _sigtramp + 56
2 node 0x000000010357fe4c uv_run + 672
3 node 0x0000000102da7b14 _ZN4node6worker16WorkerThreadDataD2Ev + 204
4 node 0x0000000102da4874 _ZN4node6worker6Worker3RunEv + 684
5 node 0x0000000102da7bc8 _ZZN4node6worker6Worker11StartThreadERKN2v820FunctionCallbackInfoINS2_5ValueEEEEN3$_38__invokeEPv + 56
6 libsystem_pthread.dylib 0x00000001ab4d426c _pthread_start + 148
7 libsystem_pthread.dylib 0x00000001ab4cf08c thread_start + 8
zsh: segmentation fault node main.js
Steps to reproduce
100% reproducible on each run on MacOS 12.5.1 (21G83) - also saw the same behavior in Linux.
cat <<EOT > main.js
const SegfaultHandler = require("segfault-handler");
const { isMainThread, Worker } = require("worker_threads");
SegfaultHandler.registerHandler("/tmp/crash.log");
async function main() {
const worker = new Worker("./worker.js");
await new Promise(r => setTimeout(r, 1000));
await worker.terminate();
}
main();
EOT
cat <<EOT > worker.js
const ddTrace = require("dd-trace");
ddTrace.init({ runtimeMetrics: true });
EOT
node main.js
Environment
- Operation system: MacOS 12.5.1 (21G83)
- Node.js version: v16.11.1
- Tracer version: 3.9.3
{
"dependencies": {
"dd-trace": "^3.9.3",
}
}
Is this an M1/M2 machine or Intel? Your reproduction case seems to not trigger this on my M1 MBP. I'm on macOS 13 so could be a difference between OS versions or could be an architecture thing. 🤔
You also said it's reproducing on Linux? What are the details on that environment? (distro, kernel version, architecture, etc)
@Qard
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ arch
aarch64
$ uname -a
Linux server-001 5.15.0-1020-aws #24-Ubuntu SMP Thu Sep 1 16:05:45 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
$ node --version
v16.18.1
$ cat ../yarn.lock | grep dd-trace
dd-trace@^3.9.0:
resolved "https://registry.yarnpkg.com/dd-trace/-/dd-trace-3.9.0.tgz#192ee3d1c6e9d82ee81a01df4df6735b38601304"
$ node main.js
Segmentation fault (core dumped)
AWS image is /aws/service/canonical/ubuntu/server/22.04/stable/20220604/arm64/hvm/ebs-gp2/ami-id
Also @Qard, I sometimes see another SIGSEGV when profiling: true is set in a worker thread. I can't provide you with a 100% repro script with this, but I think it may be related (and it doesn't repro if I turn off profiling for a worker thread). It looks like this (see screenshot).
This is a side thing though and is NOT the main topic of the current issue; the current issue (about runtimeMetrics: true) is 100% reproducible on at least two above configurations (MacOS M1 and Linux aarch64); the effect of profiling: true is JFYI.

Thanks for the additional information!
The profiler issue is probably already fixed by DataDog/pprof-nodejs#71. As for the runtime metrics issue, I'll try to spin up a similar environment to see if I can reproduce and I'll get back to you on the result. Shouldn't be hard to track it down if I can get it to reproduce.
Seems to be crashing because of this: https://github.com/nodejs/node/commit/22cbbcf9d9374d4b663bf1409f292212fa57623a
The native metrics code is currently doing an async shut down with uv_close(...) but that's invalid within AddEnvironmentCleanupHook. While it technically will not fail in the main thread, it apparently can fail in a worker thread. We'll have to rethink how we do the cleanup there. Thanks for the catch! Learned some interesting things about how Node.js handled thread cleanup. 😅
@Qard Is there any timeline for fixing this? We're running into it too.
None yet. There's a logic conflict between Node.js 14 and earlier against later Node.js versions that basically means there is no safe way to do cleanup on older Node.js versions so we'll have to add some compile-time version checks to conditionally use the newer/safer cleanup system where it exists.
Ah, gotcha, ok. Appreciate the quick response!
Any updates on this one? Our logs are littered with SIGSEGV errors when using the dd-trace lib w/ node threads (bree.js).
What are your Node.js and dd-trace versions? It should be fixed on Node.js 16+, though it's only in the v4 line of dd-trace. It's not possible to fix in v14 as there was a timing bug in Node.js that made safe cleanup in a worker impossible, which also meant the change could not be backported to any line that still supported v14.
We are using Node v18.17.0 (containerized) and dd-trace 3.32.1
Can you update to the latest 4.x release and see if that fixes it for you? It should be cleaning up properly in 4.x.
Hey @Qard, I bumped the package to 4.16.0 and we immediately get PM2 shutdown logs indicating SIGSEGV when the node_thread (bree job) exits.
@Qard Any suggestions based on the above feedback?
Any updates or estimated timeline on this? we've disabled dd-trace for multiple services because of this problem.
This issue still exists in the latest 5x version of dd-trace.
A fix for this has been released in 5.25.0/4.49.0