node
node copied to clipboard
Deadlock at process shutdown
Version
v23.0.0-pre
Platform
Linux USR-PC2HD4AD 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Subsystem
No response
What steps will reproduce the bug?
There is a deadlock that prevents the Node.js process from exiting. The issue is causing a lot (all?) of timeout failures in our CI. It can be reproduced by running a test in parallel with our test.py tool, for example
$ ./tools/test.py --repeat=10000 test/parallel/test-http2-large-write-multiple-requests.js
See also
- https://github.com/nodejs/node/issues/54133#issuecomment-2343320802
- https://github.com/nodejs/node/issues/52550#issuecomment-2343204697
- https://github.com/nodejs/node/pull/52959#issuecomment-2108711217
How often does it reproduce? Is there a required condition?
Rarely, but often enough to be a serious issue for CI.
What is the expected behavior? Why is that the expected behavior?
The process exits.
What do you see instead?
The process does not exit.
Additional information
Attaching gdb to two of the hanging processes obtained from the command above, produces the following outputs:
$ gdb -p 15884
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 15884
[New LWP 15953]
[New LWP 15952]
[New LWP 15951]
[New LWP 15950]
[New LWP 15900]
[New LWP 15889]
[New LWP 15888]
[New LWP 15887]
[New LWP 15886]
[New LWP 15885]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument. Continuing without source file ./nptl/./nptl/futex-internal.c.
0x00007f432025fd61 in __futex_abstimed_wait_common64 (private=21855, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:57
warning: 57 ./nptl/futex-internal.c: No such file or directory
(gdb) bt
#0 0x00007f432025fd61 in __futex_abstimed_wait_common64 (private=21855, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=21855, abstime=0x0, clockid=0, expected=0,
futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555fa9fce8f0, expected=expected@entry=0,
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3 0x00007f43202627dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555fa9fce870,
cond=0x555fa9fce8c8) at ./nptl/pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=0x555fa9fce8c8, mutex=0x555fa9fce870) at ./nptl/pthread_cond_wait.c:627
#5 0x0000555fa3d30d7d in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>)
at ../deps/uv/src/unix/thread.c:814
#6 0x0000555fa2c8c063 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
#7 0x0000555fa2ac528d in node::SpinEventLoopInternal(node::Environment*) ()
#8 0x0000555fa2c4abc4 in node::NodeMainInstance::Run() ()
#9 0x0000555fa2b9233a in node::Start(int, char**) ()
#10 0x00007f43201f11ca in __libc_start_call_main (main=main@entry=0x555fa2ababa0 <main>, argc=argc@entry=2,
argv=argv@entry=0x7ffe74216608) at ../sysdeps/nptl/libc_start_call_main.h:58
#11 0x00007f43201f128b in __libc_start_main_impl (main=0x555fa2ababa0 <main>, argc=2, argv=0x7ffe74216608,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe742165f8)
at ../csu/libc-start.c:360
#12 0x0000555fa2ac06a5 in _start ()
(gdb)
$ gdb -p 15885
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 15885
Reading symbols from /home/luigi/node/out/Release/node...
Reading symbols from /lib/x86_64-linux-gnu/libstdc++.so.6...
Reading symbols from /home/luigi/.cache/debuginfod_client/40b9b0d17fdeebfb57331304da2b7f85e1396ef2/debuginfo...
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...
Reading symbols from /usr/lib/debug/.build-id/90/32976b3ecc78d1362fedfcd88528562bbfb7e4.debug...
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...
Reading symbols from /home/luigi/.cache/debuginfod_client/92123f0e6223c77754bac47062c0b9713ed363df/debuginfo...
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...
Reading symbols from /usr/lib/debug/.build-id/6d/64b17fbac799e68da7ebd9985ddf9b5cb375e6.debug...
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/35/3e1b6cb0eebc08cf3ff812eae8a51b4efd684e.debug...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument. Continuing without source file ./misc/../sysdeps/unix/sysv/linux/epoll_pwait.c.
0x00007f43202f0ee0 in __GI_epoll_pwait (epfd=epfd@entry=9, events=events@entry=0x7f43201bdcb0,
maxevents=maxevents@entry=1024, timeout=timeout@entry=-1, set=set@entry=0x0)
at ../sysdeps/unix/sysv/linux/epoll_pwait.c:40
warning: 40 ../sysdeps/unix/sysv/linux/epoll_pwait.c: No such file or directory
(gdb) bt
#0 0x00007f43202f0ee0 in __GI_epoll_pwait (epfd=epfd@entry=9, events=events@entry=0x7f43201bdcb0,
maxevents=maxevents@entry=1024, timeout=timeout@entry=-1, set=set@entry=0x0)
at ../sysdeps/unix/sysv/linux/epoll_pwait.c:40
#1 0x0000555fa3d364e7 in uv__io_poll (loop=loop@entry=0x555fa9fcea88, timeout=<optimized out>)
at ../deps/uv/src/unix/linux.c:1432
#2 0x0000555fa3d20c17 in uv_run (loop=0x555fa9fcea88, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:448
#3 0x0000555fa2c8d4b1 in node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start()::{lambda(void*)#1}::_FUN(void*)
()
#4 0x00007f4320263a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#5 0x00007f43202f0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb)
As written in https://github.com/nodejs/node/issues/52550#issuecomment-2345613793, applying the patch from https://github.com/nodejs/node/pull/47452 seems to fix the issue, but it is not a solution since that patch brings other issues.
This can been seen in various CI runs, assuming confirmed-bug
The output of lldb on macOS:
$ lldb -p 5033
(lldb) process attach --pid 5033
Process 5033 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x00007ff802363c3a libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`:
-> 0x7ff802363c3a <+10>: jae 0x7ff802363c44 ; <+20>
0x7ff802363c3c <+12>: movq %rax, %rdi
0x7ff802363c3f <+15>: jmp 0x7ff8023617e0 ; cerror_nocancel
0x7ff802363c44 <+20>: retq
Target 0: (node) stopped.
Executable module set to "/Users/luigi/code/node/out/Release/node".
Architecture set to: x86_64h-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007ff802363c3a libsystem_kernel.dylib`__psynch_cvwait + 10
frame #1: 0x00007ff8023a16f3 libsystem_pthread.dylib`_pthread_cond_wait + 1211
frame #2: 0x000000010df93b23 node`uv_cond_wait(cond=0x00007f9a4d404108, mutex=0x00007f9a4d404098) at thread.c:798:7 [opt]
frame #3: 0x000000010d0be52b node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x00007f9a4d404108, mutex=0x00007f9a4d404098) at node_mutex.h:175:5 [opt]
frame #4: 0x000000010d0be511 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x00007f9a4d404108, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
frame #5: 0x000000010d0be511 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x00007f9a4d404098) at node_platform.cc:640:20 [opt]
frame #6: 0x000000010d0be4fb node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x00007f9a4d404098) at node_platform.cc:214:25 [opt]
frame #7: 0x000000010d0be4fb node`node::NodePlatform::DrainTasks(this=0x0000600002718000, isolate=<unavailable>) at node_platform.cc:467:33 [opt]
frame #8: 0x000000010cf18a50 node`node::SpinEventLoopInternal(env=0x00007f9a2c814e00) at embed_helpers.cc:44:17 [opt]
frame #9: 0x000000010d084083 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x00007f9a2c814e00) at node_main_instance.cc:111:9 [opt]
frame #10: 0x000000010d083fe9 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
frame #11: 0x000000010cfe4159 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1489:24 [opt]
frame #12: 0x000000010cfe3f54 node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1496:27 [opt]
frame #13: 0x00007ff802015345 dyld`start + 1909
(lldb)
@lpinca could you try https://github.com/nodejs/node/pull/47452/commits/ab23463bf5785ca1323922c3c5e6a2bdf09d21dd? I'm not able to reproduce the deadlock locally, so I can't confirm if it is effective.
@aduh95 see https://github.com/nodejs/node/issues/54918#issuecomment-2348295209. It seems the same patch.
No it's not, it's based on it but it's meant to fix the test failures.
Full diff is https://patch-diff.githubusercontent.com/raw/nodejs/node/pull/47452.diff
The issue persists π
gdb -p 18560
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 18560
[New LWP 18621]
[New LWP 18620]
[New LWP 18619]
[New LWP 18618]
[New LWP 18574]
[New LWP 18565]
[New LWP 18564]
[New LWP 18563]
[New LWP 18562]
[New LWP 18561]
Downloading separate debug info for /lib/x86_64-linux-gnu/libstdc++.so.6
Downloading separate debug info for system-supplied DSO at 0x7ffdf50f4000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument. Continuing without source file ./nptl/./nptl/futex-internal.c.
0x00007fca5f88ad61 in __futex_abstimed_wait_common64 (private=21976, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:57
warning: 57 ./nptl/futex-internal.c: No such file or directory
(gdb) bt
#0 0x00007fca5f88ad61 in __futex_abstimed_wait_common64 (private=21976, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=21976, abstime=0x0, clockid=0, expected=0,
futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55d834ea10e0, expected=expected@entry=0,
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3 0x00007fca5f88d7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d834ea1060,
cond=0x55d834ea10b8) at ./nptl/pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=0x55d834ea10b8, mutex=0x55d834ea1060) at ./nptl/pthread_cond_wait.c:627
#5 0x000055d82f131fed in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>)
at ../deps/uv/src/unix/thread.c:814
#6 0x000055d82e08c853 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
#7 0x000055d82dec528d in node::SpinEventLoopInternal(node::Environment*) ()
#8 0x000055d82e04b364 in node::NodeMainInstance::Run() ()
#9 0x000055d82df9276a in node::Start(int, char**) ()
#10 0x00007fca5f81c1ca in __libc_start_call_main (main=main@entry=0x55d82debac50 <main>, argc=argc@entry=2,
argv=argv@entry=0x7ffdf50a9a48) at ../sysdeps/nptl/libc_start_call_main.h:58
#11 0x00007fca5f81c28b in __libc_start_main_impl (main=0x55d82debac50 <main>, argc=2, argv=0x7ffdf50a9a48,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdf50a9a38)
at ../csu/libc-start.c:360
#12 0x000055d82dec0785 in _start ()
(gdb)
Also, test/wasi/test-wasi-initialize-validation.js and test/wasi/test-wasi-start-validation.js fail.
$ ./node test/wasi/test-wasi-initialize-validation.js
(node:13188) ExperimentalWarning: WASI is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
Mismatched noop function calls. Expected exactly 1, actual 0.
at Proxy.mustCall (/home/luigi/node/test/common/index.js:451:10)
at Object.<anonymous> (/home/luigi/node/test/wasi/test-wasi-initialize-validation.js:154:18)
at Module._compile (node:internal/modules/cjs/loader:1557:14)
at Module._extensions..js (node:internal/modules/cjs/loader:1702:10)
at Module.load (node:internal/modules/cjs/loader:1328:32)
at Module._load (node:internal/modules/cjs/loader:1138:12)
at TracingChannel.traceSync (node:diagnostics_channel:315:14)
at wrapModuleLoad (node:internal/modules/cjs/loader:217:24)
at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:166:5)
$ ./node test/wasi/test-wasi-start-validation.js
(node:13199) ExperimentalWarning: WASI is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
Mismatched noop function calls. Expected exactly 1, actual 0.
at Proxy.mustCall (/home/luigi/node/test/common/index.js:451:10)
at Object.<anonymous> (/home/luigi/node/test/wasi/test-wasi-start-validation.js:151:18)
at Module._compile (node:internal/modules/cjs/loader:1557:14)
at Module._extensions..js (node:internal/modules/cjs/loader:1702:10)
at Module.load (node:internal/modules/cjs/loader:1328:32)
at Module._load (node:internal/modules/cjs/loader:1138:12)
at TracingChannel.traceSync (node:diagnostics_channel:315:14)
at wrapModuleLoad (node:internal/modules/cjs/loader:217:24)
at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:166:5)
Hey, I've added the https://github.com/nodejs/node/labels/help%20wanted label to this issue. While all issues deserve TLC from volunteers, this one is causing a serious CI issue, and a little extra attention wouldn't hurt.
From my testing, the following snippet hangs occassionally, which might be related to this?:
https://github.com/nodejs/node/blob/2545b9e02cbdb208d531d1a04149f8b35a82ca5c/src/node_platform.cc#L640
outstanding_tasks_ is 2 when it hangs, 0 otherwise. It also looks like one of those tasks is completed, but the other one is not?
Some task isn't completing (but that was probably already known), do we know what task, or how to find out?
Update 1
It seems a task is getting stuck during BlockingPop, which might be connected to the issue.
~~Update 2~~
~~The NodeMainInstance is never properly destructed, which prevents the NodePlatform from shutting down. However, this situation appears to be a catch-22. The NodeMainInstance seems to finish only after DrainTasks is completed, but DrainTasks canβt finish until NodeMainInstance is shut down.~~
~~Update 3~~
~~Removing platform->DrainTasks(isolate); from SpinEventLoopInternal prevents the error from occurring (the draining still happens elsewhere without interruption, although other errors persist).~~
Update 4
It's not a catch-22, because the hanging process appears to be hanging before the process destruction would even occur. I suspect that if you bt all the threads, you'll see that DrainTasks isn't the only thing hanging,
Indeed, our hanging worker thread appears to be:
Thread 8 (Thread 0x7fb7c74006c0 (LWP 1322244) "node"):
#0 0x00007fb7cd49e22e in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55a4a5eefdc8) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x55a4a5eefdc8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2 0x00007fb7cd49e2ab in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55a4a5eefdc8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3 0x00007fb7cd4a0990 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55a4a5eefd78, cond=0x55a4a5eefda0) at ./nptl/pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=0x55a4a5eefda0, mutex=0x55a4a5eefd78) at ./nptl/pthread_cond_wait.c:618
#5 0x000055a46a13ad8c in void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::{lambda()#1}>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::{lambda()#1})::{lambda()#1}>(heap::base::Stack*, void*, void const*) ()
#6 0x000055a46acb27e3 in PushAllRegistersAndIterateStack ()
#7 0x000055a46a13b2df in v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*) ()
#8 0x000055a46a1ad972 in v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) ()
#9 0x000055a46a1ae0a8 in v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) ()
#10 0x000055a46a1e4891 in v8::internal::LocalFactory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) ()
#11 0x000055a46a17a4fe in v8::internal::FactoryBase<v8::internal::LocalFactory>::NewProtectedFixedArray(int) ()
#12 0x000055a46a364a6d in v8::internal::DeoptimizationData::New(v8::internal::LocalIsolate*, int) ()
#13 0x000055a46a7b8d3f in v8::internal::maglev::MaglevCodeGenerator::GenerateDeoptimizationData(v8::internal::LocalIsolate*) ()
#14 0x000055a46a7b991b in v8::internal::maglev::MaglevCodeGenerator::BuildCodeObject(v8::internal::LocalIsolate*) [clone .part.0] ()
#15 0x000055a46a7d9948 in v8::internal::maglev::MaglevCodeGenerator::Assemble() ()
#16 0x000055a46a82f269 in v8::internal::maglev::MaglevCompiler::Compile(v8::internal::LocalIsolate*, v8::internal::maglev::MaglevCompilationInfo*) ()
#17 0x000055a46a830b49 in v8::internal::maglev::MaglevCompilationJob::ExecuteJobImpl(v8::internal::RuntimeCallStats*, v8::internal::LocalIsolate*) ()
#18 0x000055a469ffb6bb in v8::internal::OptimizedCompilationJob::ExecuteJob(v8::internal::RuntimeCallStats*, v8::internal::LocalIsolate*) ()
#19 0x000055a46a831047 in v8::internal::maglev::MaglevConcurrentDispatcher::JobTask::Run(v8::JobDelegate*) ()
#20 0x000055a46b21b45f in v8::platform::DefaultJobWorker::Run() ()
#21 0x000055a469cb125f in node::(anonymous namespace)::PlatformWorkerThread(void*) ()
#22 0x00007fb7cd4a1732 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#23 0x00007fb7cd51c2b8 in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
Update 5
The hanging tasks is pushed via the following bt:
1: 0x5582238b587b node::TaskQueue<v8::Task>::Push(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [out/Release/node]
2: 0x5582238b3d47 node::NodePlatform::PostTaskOnWorkerThreadImpl(v8::TaskPriority, std::unique_ptr<v8::Task, std::default_delete<v8::Task> >, v8::SourceLocation const&) [out/Release/node]
3: 0x558224e25304 v8::platform::DefaultJobState::CallOnWorkerThread(v8::TaskPriority, std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [out/Release/node]
4: 0x558224e2548f [out/Release/node]
5: 0x558223c0ca77 [out/Release/node]
6: 0x558223c0e337 v8::internal::Compiler::CompileOptimized(v8::internal::Isolate*, v8::internal::Handle<v8::internal::JSFunction>, v8::internal::ConcurrencyMode, v8::internal::CodeKind) [out/Release/node]
7: 0x55822423e64a v8::internal::Runtime_CompileOptimized(int, unsigned long*, v8::internal::Isolate*) [out/Release/node]
8: 0x5581c496f276
So it looks like this is a V8 hanging-task issue? I'll reach out to them and see if they know anything. (https://issues.chromium.org/issues/374285493)
From the discussion in https://issues.chromium.org/issues/374285493 (closed as Won't Fix - Intended Behavior) the issue is not in V8 but in Node.js.
@Qard did you work on a similar issue?
You're thinking of https://github.com/nodejs/node/pull/56191? I don't think it's related, but maybe something that was expected to happen is not now with the bailout when it can't call into JS? Can't see how that'd be possible though. π€
I mean that it might be a similar bug, more or less in the same spot.
@bnoordhuis do you think this is something you can take a look at? I think we are out of luck on this one.
I don't have much spare time at the moment but if the TSC wants to consult with me, I can probably carve out a few hours next month.
Possible hint? https://github.com/holepunchto/bare/pull/54
Yep, I know what the issue is it's just a lot of work to fix it. It's a mutual locking problem. The main thread calls DrainTasks which waits for all the tasks currently queued in NodePlatform to complete. If one of those tasks happens to be a concurrent compilation it will likely need to allocate. If it allocates it may need to GC. If it GCs it will lock the main thread in the middle of its own lock and deadlock the process.
The hard part is not so much identifying the issue, it's that fixing it requires pretty much a complete rewrite of NodePlatform, which is a non-trivial amount of work.
I took a look and built upon the idea in https://github.com/nodejs/node/pull/47452 - but still block on kUserBlocking tasks (which is also what the documentation suggests, IIUC). That seems to make the deadlock go away according to the stress tests. https://github.com/nodejs/node/pull/58047
I changed https://github.com/nodejs/node/pull/58047 to only ref this issue instead of closing it, because I think that PR only eliminates a subset of the deadlocks, but not all - it may still be possible for a user blocking worker task to post a foreground task right before the foreground thread goes into sleep, just that most user blocking worker tasks tend not do that (but it is still possible e.g. with the use of the high priority concurrent compilation flags, or a major GC triggered by the heap reaching a hard limit) or that they are less likely to hit that unfortunate timing. So I left a big FIXME comment in the source describing the situation.