playwright fix: gracefully close unused workers

This patch suggest a fixing issue #30504 by gracefully terminating the worker before exiting the process

Apr 24 '24 13:04 NoamGaash

Test results for "tests 1"

1 failed :x: [playwright-test] › runner.spec.ts:778:5 › wait for workers to finish before reporter.onEnd

1 flaky

:warning: [playwright-test] › ui-mode-test-watch.spec.ts:145:5 › should watch all

27434 passed, 672 skipped :heavy_check_mark::heavy_check_mark::heavy_check_mark:

Merge workflow run.

Apr 24 '24 14:04 github-actions[bot]

Test results for "tests 1"

27476 passed, 672 skipped :heavy_check_mark::heavy_check_mark::heavy_check_mark:

Merge workflow run.

Apr 25 '24 08:04 github-actions[bot]

Investigation notes:

the dispatcher calls worker.stop() because _isWorkerRedundant(worker) evaluates as true.
the worker is considered to be redundant because isWorkerRedundant sees that the _queuedOrRunningHashCount is zero - there are no queued or running jobs, as the worker teardown doesn't count as a job

I'm a little stuck with this investigation, as I can't really figure out how exactly the worker-scoped fixtures are registered.

May 05 '24 07:05 NoamGaash

@NoamGaash I spent some time on the issue, and the fix is not that straightforward. Therefore, I went ahead and prepared a PR myself - https://github.com/microsoft/playwright/pull/30769. Thank you for the PR and investigation!

May 15 '24 17:05 dgozman

@dgozman Thank you so very much! It was a real blocker, and I'm so glad for your help over here. Also - it's a great learning opportunity. Do you mind if I'll ask a little to get a better understanding? I'm sure I'll have future opportunities to contribute, and it's inspiring to see the clean code and architecture.

As you said yourself, graceful termination may leave zombie processes, therefore I thought the right approach to solve this issue would be investigating why it's called in the first place. The root cause for calling the worker termination was the _isWorkerRedundant method that iterates all worker slots, and sees whether any of them is occupied with the task assigned to the current worker.

It seems like slot.worker.didSendStop() of the _isWorkerRedundant method evaluates to be true, so I thought the real problem lays somewhere inside the test runner architecture and there's some stop command being sent when the last test is executed. That's why I'm surprised to see your solution includes conditioning the force exit - is it a temporary solution, or that's the "right thing to do"?

Thanks again, both for responding and solving this issue so quickly and for Playwright as whole :smiley_cat:

May 16 '24 06:05 NoamGaash

@NoamGaash There is a difference between normal operation worker stop, and the case where something went wrong. So my change assumes that during normal worker stop, triggered by _isWorkerRedundant or dispatcher.stop(), worker teardown behaves. However, if something went wrong, upon Ctrl+C we'll disconnect and worker will force exit.

May 16 '24 15:05 dgozman

I see. Thanks for this clarification!

May 16 '24 21:05 NoamGaash