pytest-xdist
pytest-xdist copied to clipboard
pytest-xdist doesn't recover from kernel killing a worker
When run in an environment guarded by cgroups resource control (any cloud instance), sometimes the kernel kills one of the pytest workers if the total usage of RSS is bigger than the allowed quota. And while xdist notices that and starts a new worker, the test suite never recovers. All previously existing workers stop their work and are idling and nothing happens.
xdist definitely recovers from a test crashing a worker, but it doesn't seem to be the case with when a kernel does the killing.
The main issue here, is that the whole setup hangs, so on CIs it's a big problem.
Thank you!
@stas00, how did you discover that the kernel was killing your pytest workers in your situation?
I am encountering a similar scenario. I am running pytest-xdist on Kubernetes via Gitlab CI, and noticed that a small portion of pytest-xdist runs cause my pipelines to hang indefinitely until timeout. I also noticed a correlation between this idling behavior and a plateaued memory usage within my Kubernetes pods, so I suspect it has to do with memory usage.
The process is documented here: https://github.com/huggingface/transformers/issues/11408
I think were running into this same problem when running the scipy test suite with dev.py ... --parallel N
(with N
> 1), when running in a Slurm job environment in which cgroups are used (see also https://github.com/easybuilders/easybuild-easyblocks/pull/2980).
There's no problem when using --parallel 1
, or not using --parallel
at all, both of which imply not using pytest-xdist
.
It may be necessary to add some type of regular ping to execnet to account for oom kills
This ought to show up as closed pipe unless the fd isn't closed by the oom as well