pytest-xdist pytest-xdist doesn't recover from kernel killing a worker

pytest-xdist doesn't recover from kernel killing a worker

Open stas00 opened this issue 4 years ago • 4 comments

When run in an environment guarded by cgroups resource control (any cloud instance), sometimes the kernel kills one of the pytest workers if the total usage of RSS is bigger than the allowed quota. And while xdist notices that and starts a new worker, the test suite never recovers. All previously existing workers stop their work and are idling and nothing happens.

xdist definitely recovers from a test crashing a worker, but it doesn't seem to be the case with when a kernel does the killing.

The main issue here, is that the whole setup hangs, so on CIs it's a big problem.

Thank you!

May 07 '21 16:05 stas00

@stas00, how did you discover that the kernel was killing your pytest workers in your situation?

I am encountering a similar scenario. I am running pytest-xdist on Kubernetes via Gitlab CI, and noticed that a small portion of pytest-xdist runs cause my pipelines to hang indefinitely until timeout. I also noticed a correlation between this idling behavior and a plateaued memory usage within my Kubernetes pods, so I suspect it has to do with memory usage.

Feb 07 '22 23:02 hughhan1

The process is documented here: https://github.com/huggingface/transformers/issues/11408

Feb 08 '22 00:02 stas00

I think were running into this same problem when running the scipy test suite with dev.py ... --parallel N (with N > 1), when running in a Slurm job environment in which cgroups are used (see also https://github.com/easybuilders/easybuild-easyblocks/pull/2980). There's no problem when using --parallel 1, or not using --parallel at all, both of which imply not using pytest-xdist.

Aug 08 '23 06:08 boegel

It may be necessary to add some type of regular ping to execnet to account for oom kills

This ought to show up as closed pipe unless the fd isn't closed by the oom as well

Aug 08 '23 06:08 RonnyPfannschmidt

pytest-xdist pytest-xdist copied to clipboard

pytest-xdist doesn't recover from kernel killing a worker

pytest-xdist
pytest-xdist copied to clipboard