webknossos-libs icon indicating copy to clipboard operation
webknossos-libs copied to clipboard

Poll PIDs of child processes to detect OOM killing

Open philippotto opened this issue 3 years ago • 4 comments

When using multiprocessing (via clustertools), it can happen that the child processes are killed due to OOM. If the OOM killer decides to kill the children, the parent potentially waits forever (sometimes a broken process pool exception will be raised, but we've seen cases where this does not happen). A workaround could be to regularly poll for the child PIDs. If they are not running, a warning could be emitted. Maybe a hard timeout would also make sense (where the main process exists if the timeout is exceeded).

/edit: multiprocessing.active_children() might be helpful.

philippotto avatar Jan 07 '22 08:01 philippotto

I agree that a hard timeout makes sense, this way we can get a job notification in slack (not sure how easy or reasonable it would be to implement the warning also via slack, but not hard kill)

fm3 avatar Jan 07 '22 08:01 fm3

I think a hard timeout would be hard to define. There can be valid cases for jobs that go on for days.

normanrz avatar Apr 06 '22 15:04 normanrz

I think a hard timeout would be hard to define. There can be valid cases for jobs that go on for days.

The timeout would only kick in, if a multiprocessing executor awaits its children and no children PIDs exist anymore. In that case, awaiting them is doomed to fail. Strictly speaking, a timeout wouldn't even be necessary in my opinion.

philippotto avatar Apr 06 '22 16:04 philippotto

Update: When creating https://github.com/scalableminds/webknossos-libs/pull/739 I noticed that OOMs typically end in a BrokenProcessPool exception. That exception can (and will be) handled by the resumable executor in the upcoming version of vx. I think, older python versions sometimes were not able to catch the broken process pool (ending up in a hanging state), but I think I haven't seen this for quite a while (last time was https://github.com/scalableminds/webknossos-libs/issues/539, but I'm not sure which python version was used there). So, I'd defer this issue for now until we see cases where the BrokenProcessPool exception is not triggered.

philippotto avatar May 31 '22 13:05 philippotto