node icon indicating copy to clipboard operation
node copied to clipboard

src: only block on user blocking worker tasks

Open joyeecheung opened this issue 7 months ago β€’ 15 comments

test: remove deadlock workaround

src: add more debug logs and comments in NodePlatform

src: use priority queue to run worker tasks

According to the documentation, the v8 tasks should be executed based on priority. Previously we always execute the tasks in FIFO order, this changes the NodePlatform implementation to execute the higher priority tasks first. The tasks used to schedule timers for the delayed tasks are run in FIFO order since priority is irrelavent for the timer scheduling part while the tasks unwrapped by the timer callbacks are still ordered by priority.

src: only block on user blocking worker tasks

we should not be blocking on the worker tasks on the main thread in one go. Doing so leads to two problems:

  1. If any of the worker tasks post another foreground task and wait for it to complete, and that foreground task is posted right after we flush the foreground task queue and before the foreground thread goes into sleep, we'll never be able to wake up to execute that foreground task and in turn the worker task will never complete, and we have a deadlock.
  2. Worker tasks can be posted from any thread, not necessarily associated with the current isolate, and we can be blocking on a worker task that is associated with a completely unrelated isolate in the event loop. This is suboptimal.

However, not blocking on the worker tasks at all can lead to loss of some critical user-blocking worker tasks e.g. wasm async compilation tasks, which should block the main thread until they are completed, as the documentation suggets. As a compromise, we currently only block on user-blocking tasks to reduce the chance of deadlocks while making sure that criticl user-blocking tasks are not lost.

Refs: https://github.com/nodejs/node/pull/47452 Refs: https://github.com/nodejs/node/issues/54918

joyeecheung avatar Apr 27 '25 00:04 joyeecheung

This seems to fix https://github.com/nodejs/node/issues/54918 - I started two stress test jobs before opening this PR:

  • With just the test changes that reverted the workarounds: https://ci.nodejs.org/job/node-stress-single-test/565/ ❌
  • With the changes in this PR and the test workaround reverts: https://ci.nodejs.org/job/node-stress-single-test/566/ βœ…

And the CI seems to be happy about it: https://ci.nodejs.org/job/node-test-commit/79439/

I am being somewhat lazy here to achieve the "only block on user blocking worker tasks" but only counting them as outstanding tasks. Need another look to make sure I am not throwing other tasks into the queue in the delayed scheduler. Also I am just speculating that worker tasks that would post another foreground task shouldn't be kUserBlocking, or at least I haven't found any so far (the ones that are causing the deadlocks that I can reproduce are just kUserVisible GC tasks). May need to confirm this.

joyeecheung avatar Apr 27 '25 00:04 joyeecheung

CI: https://ci.nodejs.org/job/node-test-pull-request/66479/

nodejs-github-bot avatar Apr 27 '25 01:04 nodejs-github-bot

I confirm that this patch fixes the issue on my machine.

lpinca avatar Apr 27 '25 05:04 lpinca

I tested this with https://github.com/nodejs/node/pull/58070 which is also seeing a deadlock in the heapdump tests. The good news is that this is suffice to fix the deadlock by default. The bad news is that --concurrent-maglev-high-priority-threads can enqueue a background task that is kUserBlocking and block on foreground tasks, which can trigger the deadlock again (although, that is an opt-in flag, and likely only intended for testing, but that also means this kind of tasks are not off the table).

I think at the end of the day, the ideal solution should still avoid doing a blocking drain of any worker tasks on the main thread (but this requires some other tinkering to make sure that user blocking tasks block the event loop from exiting to prevent important tasks from being dropped), or re-architect the blocking mechanism to allow foreground task posted by worker tasks to wake the main thread up and run foreground tasks (maybe it's simple enough to implement, maybe it needs extra locks which can also incur other problems, I don't know for sure yet). But this fix might still be good enough for now since user blocking background tasks are rare, or at least it does make things better than how things currently are.

joyeecheung avatar Apr 30 '25 22:04 joyeecheung

Alternative fix which seems to also fix the user-blocking background tasks locally when built together with https://github.com/nodejs/node/pull/58070: https://github.com/joyeecheung/node/tree/fix-deadlock-2

Starting a stress test here: https://ci.nodejs.org/job/node-stress-single-test/569 CI with https://github.com/nodejs/node/pull/58070: https://ci.nodejs.org/job/node-test-commit/79519/

joyeecheung avatar May 01 '25 22:05 joyeecheung

It seems to introduce timeouts in other tests.

targos avatar May 02 '25 07:05 targos

CI: https://ci.nodejs.org/job/node-test-pull-request/66549/

nodejs-github-bot avatar May 02 '25 17:05 nodejs-github-bot

Trying to see if it still works after the V8 upgrade. On the other hand it seems I can no longer reproduce the deadlock on macOS using test-file-write-stream4.js. I do notice that the test now run much much faster than before (before the V8 upgrade it takes 52s to finish 1000 runs, now it takes 9s). I'll see if I can reproduce with the heapdump tests locally. EDIT: the heapdump test can reproduce the deadlock locally and has almost the same stacktrace as the one produced by test-file-write-stream4.js

joyeecheung avatar May 02 '25 18:05 joyeecheung

@joyeecheung due to the CI constantly failing due to the heapdumps now dying, do you think this could already land in it's current form knowing that it does not yet fully address the problem?

BridgeAR avatar May 03 '25 17:05 BridgeAR

Yeah I think this at least serves as a band-aid to the problem and can avoid a significant subset of the deadlocks for now. I am going to clean up the patch a bit before sending it for review.

joyeecheung avatar May 03 '25 19:05 joyeecheung

Cleaned up the commits a bit and added more comments. Split into multiple commits - the last one would make the deadlocks less common. I kept the previous ones in case the last one caused more problems than it solves and we need to revert it.

joyeecheung avatar May 04 '25 01:05 joyeecheung

Stress test: https://ci.nodejs.org/job/node-stress-single-test/572/

joyeecheung avatar May 04 '25 01:05 joyeecheung

Failed to start CI
   ⚠  Commits were pushed since the last approving review:
   ⚠  - test: remove deadlock workaround
   ⚠  - src: add more debug logs and comments in NodePlatform
   ⚠  - src: use priority queue to run worker tasks
   ⚠  - src: only block on user blocking worker tasks
   ✘  Refusing to run CI on potentially unsafe PR
https://github.com/nodejs/node/actions/runs/14816561459

github-actions[bot] avatar May 04 '25 01:05 github-actions[bot]

Codecov Report

Attention: Patch coverage is 51.89873% with 76 lines in your changes missing coverage. Please review.

Project coverage is 90.13%. Comparing base (723d7bb) to head (3fff8ae). Report is 567 commits behind head on main.

Files with missing lines Patch % Lines
src/node_platform.cc 49.66% 67 Missing and 9 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #58047      +/-   ##
==========================================
- Coverage   90.17%   90.13%   -0.04%     
==========================================
  Files         630      630              
  Lines      186473   186581     +108     
  Branches    36613    36625      +12     
==========================================
+ Hits       168160   168184      +24     
- Misses      11128    11196      +68     
- Partials     7185     7201      +16     
Files with missing lines Coverage Ξ”
src/debug_utils.h 80.00% <ΓΈ> (ΓΈ)
src/node_platform.h 83.33% <100.00%> (+23.33%) :arrow_up:
src/node_platform.cc 76.24% <49.66%> (-12.36%) :arrow_down:

... and 28 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar May 04 '25 02:05 codecov[bot]

CI: https://ci.nodejs.org/job/node-test-pull-request/66576/

nodejs-github-bot avatar May 04 '25 03:05 nodejs-github-bot

CI seems happy. Can I have some review please? @Qard @nodejs/cpp-reviewers

joyeecheung avatar May 04 '25 15:05 joyeecheung

Landed in 6de55f7ffb98d1870f376cda7b5bcb7f76566896...5fb879c4584cc98889f4424cb60e5f37eb1ea4e7

nodejs-github-bot avatar May 04 '25 16:05 nodejs-github-bot

Hmm realized that the commit queue starts counting from the date the PR is opened, not the date the review is open. Anyway landing it now to make the CI greener - if there are any issues we can follow up on it.

joyeecheung avatar May 04 '25 16:05 joyeecheung

@joyeecheung sorry, seems my review was too late, was interrupted a few times. no major findings anyway.

Flarna avatar May 05 '25 06:05 Flarna

The remaining commits are causing CI failures on v22.x-staging

aduh95 avatar Jul 24 '25 18:07 aduh95

The remaining commits

@aduh95 did you mean the commits in this PR or some of the commits in particular?

joyeecheung avatar Jul 25 '25 16:07 joyeecheung

@joyeecheung I meant 3ac8c686a3d5ce7c03592ada8e5c1f90b2aaf985...0d3c6bb09b5c71fa12f9787812300a469589e94d – https://github.com/nodejs/node/commit/eef37d00cb and https://github.com/nodejs/node/commit/df2a36bfcc already found their way into 22.16.0

aduh95 avatar Jul 25 '25 17:07 aduh95