fix: gracefully close unused workers
This patch suggest a fixing issue #30504 by gracefully terminating the worker before exiting the process
Test results for "tests 1"
1 failed :x: [playwright-test] › runner.spec.ts:778:5 › wait for workers to finish before reporter.onEnd
1 flaky
:warning: [playwright-test] › ui-mode-test-watch.spec.ts:145:5 › should watch all27434 passed, 672 skipped :heavy_check_mark::heavy_check_mark::heavy_check_mark:
Merge workflow run.
Test results for "tests 1"
27476 passed, 672 skipped :heavy_check_mark::heavy_check_mark::heavy_check_mark:
Merge workflow run.
Investigation notes:
- the dispatcher calls
worker.stop()because_isWorkerRedundant(worker)evaluates astrue. - the worker is considered to be redundant because
isWorkerRedundantsees that the_queuedOrRunningHashCountis zero - there are no queued or running jobs, as the worker teardown doesn't count as a job
I'm a little stuck with this investigation, as I can't really figure out how exactly the worker-scoped fixtures are registered.
@NoamGaash I spent some time on the issue, and the fix is not that straightforward. Therefore, I went ahead and prepared a PR myself - https://github.com/microsoft/playwright/pull/30769. Thank you for the PR and investigation!
@dgozman Thank you so very much! It was a real blocker, and I'm so glad for your help over here. Also - it's a great learning opportunity. Do you mind if I'll ask a little to get a better understanding? I'm sure I'll have future opportunities to contribute, and it's inspiring to see the clean code and architecture.
As you said yourself, graceful termination may leave zombie processes, therefore I thought the right approach to solve this issue would be investigating why it's called in the first place. The root cause for calling the worker termination was the _isWorkerRedundant method that iterates all worker slots, and sees whether any of them is occupied with the task assigned to the current worker.
It seems like slot.worker.didSendStop() of the _isWorkerRedundant method evaluates to be true, so I thought the real problem lays somewhere inside the test runner architecture and there's some stop command being sent when the last test is executed. That's why I'm surprised to see your solution includes conditioning the force exit - is it a temporary solution, or that's the "right thing to do"?
Thanks again, both for responding and solving this issue so quickly and for Playwright as whole :smiley_cat:
@NoamGaash There is a difference between normal operation worker stop, and the case where something went wrong. So my change assumes that during normal worker stop, triggered by _isWorkerRedundant or dispatcher.stop(), worker teardown behaves. However, if something went wrong, upon Ctrl+C we'll disconnect and worker will force exit.
I see. Thanks for this clarification!