lkpy icon indicating copy to clipboard operation
lkpy copied to clipboard

Parallel processing locks when the OOM killer comes for a worker

Open mdekstrand opened this issue 3 years ago • 1 comments

When LensKit is working in parallel (e.g. batch.recommend), and the OOM killer takes out a worker, the parent LensKit process will (sometimes) hang instead of terminating.

We should detect this case and abort the entire evaluation if the pool breaks down.

mdekstrand avatar Dec 15 '21 00:12 mdekstrand

I have tried to reproduce this with processes that invoke os.kill(os.getpid(), 9), and the parent process terminates correctly.

OOM-induced deadlocks in Python multiprocessing seem to be one of the bugs fixed in concurrent.futures.ProcessPoolExecutor in Python 3.7 and newer, and we saw this on Python 3.8.

mdekstrand avatar Dec 15 '21 18:12 mdekstrand