ipyparallel
ipyparallel copied to clipboard
More user-friendly errors and automatic restarts in case of engines crashing due to OOM
The errors we report in case of OOM and Segmentation-Fault are now much better, but I was wondering is there a way to make them more "user-friendly"?
- Currently, at least for the MPI case, we report the mpiexec output, which is great, but could there be a way to report a cleaner error in addition to this, that could clearly identify this as a OOM error (or a seg-fault if possible)?
- Is there something that packages (like Bodo) could do to make this experience better/easier?
- What's the best way to automate restart of engines in this case? Ideally, if enabled, in cases where the engines crash, if we could clean up the processes, display a message (e.g. "engines crashed due to OOM, restarting engines..."), and then restart the engines, that would be useful.
I think it's hard to do this in general such that it fits in the base class, but Launchers have two relevant methods:
-
_log_output
which is called on stop. This is what logs the mpi errors. You can override this in your custom Launcher to do further processing/parsing of the output to change what's logged by default instead of or in addition to the current MPI output - Launcher.on_stop allows registering arbitrary stop callbacks. example notebook.
If you already have a custom launcher, you can combine these to add self.on_stop(self.custom_log_message)
at the end of .start()
to always add your own custom stop handlers.
Thanks @minrk! Will try this out.
@minrk Any feedback on the automatic restart setup?
Sorry, missed that part. Automatic restart could possibly also be achieved through the on_stop callback. The question becomes whether it makes sense to restart the same engine set vs starting a new one. Restarting in-place would probably feel cleaner, but likely would also make debugging more challenging (e.g. losing handles on the logs for the crashed engines). Starting a new engine set is simpler, because you only need to call cluster.start_engines(n)
.
I think it's reasonable for restart-on-fail to be a built-in feature for Engine[Set]Launcher, but it should be possible now via on_stop.
Thanks @minrk! Will try out building restart in a custom launcher. Will also open a separate issue for built-in restart support.
UPDATE: Opened this issue: https://github.com/ipython/ipyparallel/issues/706