node
node copied to clipboard
src: only block on user blocking worker tasks
test: remove deadlock workaround
src: add more debug logs and comments in NodePlatform
src: use priority queue to run worker tasks
According to the documentation, the v8 tasks should be executed based on priority. Previously we always execute the tasks in FIFO order, this changes the NodePlatform implementation to execute the higher priority tasks first. The tasks used to schedule timers for the delayed tasks are run in FIFO order since priority is irrelavent for the timer scheduling part while the tasks unwrapped by the timer callbacks are still ordered by priority.
src: only block on user blocking worker tasks
we should not be blocking on the worker tasks on the main thread in one go. Doing so leads to two problems:
- If any of the worker tasks post another foreground task and wait for it to complete, and that foreground task is posted right after we flush the foreground task queue and before the foreground thread goes into sleep, we'll never be able to wake up to execute that foreground task and in turn the worker task will never complete, and we have a deadlock.
- Worker tasks can be posted from any thread, not necessarily associated with the current isolate, and we can be blocking on a worker task that is associated with a completely unrelated isolate in the event loop. This is suboptimal.
However, not blocking on the worker tasks at all can lead to loss of some critical user-blocking worker tasks e.g. wasm async compilation tasks, which should block the main thread until they are completed, as the documentation suggets. As a compromise, we currently only block on user-blocking tasks to reduce the chance of deadlocks while making sure that criticl user-blocking tasks are not lost.
Refs: https://github.com/nodejs/node/pull/47452 Refs: https://github.com/nodejs/node/issues/54918
This seems to fix https://github.com/nodejs/node/issues/54918 - I started two stress test jobs before opening this PR:
- With just the test changes that reverted the workarounds: https://ci.nodejs.org/job/node-stress-single-test/565/ β
- With the changes in this PR and the test workaround reverts: https://ci.nodejs.org/job/node-stress-single-test/566/ β
And the CI seems to be happy about it: https://ci.nodejs.org/job/node-test-commit/79439/
I am being somewhat lazy here to achieve the "only block on user blocking worker tasks" but only counting them as outstanding tasks. Need another look to make sure I am not throwing other tasks into the queue in the delayed scheduler. Also I am just speculating that worker tasks that would post another foreground task shouldn't be kUserBlocking, or at least I haven't found any so far (the ones that are causing the deadlocks that I can reproduce are just kUserVisible GC tasks). May need to confirm this.
CI: https://ci.nodejs.org/job/node-test-pull-request/66479/
I confirm that this patch fixes the issue on my machine.
I tested this with https://github.com/nodejs/node/pull/58070 which is also seeing a deadlock in the heapdump tests. The good news is that this is suffice to fix the deadlock by default. The bad news is that --concurrent-maglev-high-priority-threads can enqueue a background task that is kUserBlocking and block on foreground tasks, which can trigger the deadlock again (although, that is an opt-in flag, and likely only intended for testing, but that also means this kind of tasks are not off the table).
I think at the end of the day, the ideal solution should still avoid doing a blocking drain of any worker tasks on the main thread (but this requires some other tinkering to make sure that user blocking tasks block the event loop from exiting to prevent important tasks from being dropped), or re-architect the blocking mechanism to allow foreground task posted by worker tasks to wake the main thread up and run foreground tasks (maybe it's simple enough to implement, maybe it needs extra locks which can also incur other problems, I don't know for sure yet). But this fix might still be good enough for now since user blocking background tasks are rare, or at least it does make things better than how things currently are.
Alternative fix which seems to also fix the user-blocking background tasks locally when built together with https://github.com/nodejs/node/pull/58070: https://github.com/joyeecheung/node/tree/fix-deadlock-2
Starting a stress test here: https://ci.nodejs.org/job/node-stress-single-test/569 CI with https://github.com/nodejs/node/pull/58070: https://ci.nodejs.org/job/node-test-commit/79519/
It seems to introduce timeouts in other tests.
CI: https://ci.nodejs.org/job/node-test-pull-request/66549/
Trying to see if it still works after the V8 upgrade. On the other hand it seems I can no longer reproduce the deadlock on macOS using test-file-write-stream4.js. I do notice that the test now run much much faster than before (before the V8 upgrade it takes 52s to finish 1000 runs, now it takes 9s). I'll see if I can reproduce with the heapdump tests locally. EDIT: the heapdump test can reproduce the deadlock locally and has almost the same stacktrace as the one produced by test-file-write-stream4.js
@joyeecheung due to the CI constantly failing due to the heapdumps now dying, do you think this could already land in it's current form knowing that it does not yet fully address the problem?
Yeah I think this at least serves as a band-aid to the problem and can avoid a significant subset of the deadlocks for now. I am going to clean up the patch a bit before sending it for review.
Cleaned up the commits a bit and added more comments. Split into multiple commits - the last one would make the deadlocks less common. I kept the previous ones in case the last one caused more problems than it solves and we need to revert it.
Stress test: https://ci.nodejs.org/job/node-stress-single-test/572/
Failed to start CI
β Commits were pushed since the last approving review: β - test: remove deadlock workaround β - src: add more debug logs and comments in NodePlatform β - src: use priority queue to run worker tasks β - src: only block on user blocking worker tasks β Refusing to run CI on potentially unsafe PRhttps://github.com/nodejs/node/actions/runs/14816561459
Codecov Report
Attention: Patch coverage is 51.89873% with 76 lines in your changes missing coverage. Please review.
Project coverage is 90.13%. Comparing base (
723d7bb) to head (3fff8ae). Report is 567 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/node_platform.cc | 49.66% | 67 Missing and 9 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #58047 +/- ##
==========================================
- Coverage 90.17% 90.13% -0.04%
==========================================
Files 630 630
Lines 186473 186581 +108
Branches 36613 36625 +12
==========================================
+ Hits 168160 168184 +24
- Misses 11128 11196 +68
- Partials 7185 7201 +16
| Files with missing lines | Coverage Ξ | |
|---|---|---|
| src/debug_utils.h | 80.00% <ΓΈ> (ΓΈ) |
|
| src/node_platform.h | 83.33% <100.00%> (+23.33%) |
:arrow_up: |
| src/node_platform.cc | 76.24% <49.66%> (-12.36%) |
:arrow_down: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
CI: https://ci.nodejs.org/job/node-test-pull-request/66576/
CI seems happy. Can I have some review please? @Qard @nodejs/cpp-reviewers
Landed in 6de55f7ffb98d1870f376cda7b5bcb7f76566896...5fb879c4584cc98889f4424cb60e5f37eb1ea4e7
Hmm realized that the commit queue starts counting from the date the PR is opened, not the date the review is open. Anyway landing it now to make the CI greener - if there are any issues we can follow up on it.
@joyeecheung sorry, seems my review was too late, was interrupted a few times. no major findings anyway.
The remaining commits are causing CI failures on v22.x-staging
The remaining commits
@aduh95 did you mean the commits in this PR or some of the commits in particular?
@joyeecheung I meant 3ac8c686a3d5ce7c03592ada8e5c1f90b2aaf985...0d3c6bb09b5c71fa12f9787812300a469589e94d β https://github.com/nodejs/node/commit/eef37d00cb and https://github.com/nodejs/node/commit/df2a36bfcc already found their way into 22.16.0