mpi-operator
mpi-operator copied to clipboard
Keep Launcher Alive to Retrieve Logs on Error
Hi I'm a relatively new user. I'm using my own dockerfile for a project that uses MPICH, and I have been able to successfully deploy the project. However, sometimes I'll be testing features changes and then I'll get an error which terminates the launcher.
I understand that this project is designed to automatically restart the launcher backoffLimit
times, but even when I set the backoffLimit
to 1, the launcher terminates so I am unable to retrieve the logs. What is the best way to retrieve the logs in this situation? Is there an option to avoid terminating the launcher once the number of failures have reached the backoff limit?