webknossos-libs
webknossos-libs copied to clipboard
Poll PIDs of child processes to detect OOM killing
When using multiprocessing (via clustertools), it can happen that the child processes are killed due to OOM. If the OOM killer decides to kill the children, the parent potentially waits forever (sometimes a broken process pool exception will be raised, but we've seen cases where this does not happen). A workaround could be to regularly poll for the child PIDs. If they are not running, a warning could be emitted. Maybe a hard timeout would also make sense (where the main process exists if the timeout is exceeded).
/edit: multiprocessing.active_children()
might be helpful.
I agree that a hard timeout makes sense, this way we can get a job notification in slack (not sure how easy or reasonable it would be to implement the warning also via slack, but not hard kill)
I think a hard timeout would be hard to define. There can be valid cases for jobs that go on for days.
I think a hard timeout would be hard to define. There can be valid cases for jobs that go on for days.
The timeout would only kick in, if a multiprocessing executor awaits its children and no children PIDs exist anymore. In that case, awaiting them is doomed to fail. Strictly speaking, a timeout wouldn't even be necessary in my opinion.
Update: When creating https://github.com/scalableminds/webknossos-libs/pull/739 I noticed that OOMs typically end in a BrokenProcessPool
exception. That exception can (and will be) handled by the resumable executor in the upcoming version of vx. I think, older python versions sometimes were not able to catch the broken process pool (ending up in a hanging state), but I think I haven't seen this for quite a while (last time was https://github.com/scalableminds/webknossos-libs/issues/539, but I'm not sure which python version was used there). So, I'd defer this issue for now until we see cases where the BrokenProcessPool
exception is not triggered.