mpi-operator
mpi-operator copied to clipboard
Can't access launcher logs after MpiJob fails
Launcher job deletes all the pods after the number of retries exceeds backoffLimit
or the total time exceeds activeDeadlineSeconds
. Users just can't access the error logs after the MPIJob failed.
It's the Kubernetes batch Job's behavior to delete all the logs if the Job fails. Link to source code: https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/pkg/controller/job/job_controller.go#L520 https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/pkg/controller/job/job_controller.go#L602
Any advice?
@czheng94 Do you mean "cannot access worker logs" instead of "launcher logs"? There will still be logs for workers if kube batch is not used, right?
@terrytangyuan I meant launcher logs. mpirun doesn't emit any logs in the workers. The key is that failed launcher pod will be deleted by kube batch job after it's done with retrying.
If that's how kube batch works now then I don't think there's a way to get around it since kube batch doesn't expose this option. I would recommend looking into persisting your logs to disk using a logging sidecar service.
Indeed this is expected behavior of kube batch job. If you set restartPolicy = "OnFailure"
in the launcher pod template, all pods will be terminated and deleted if backoff limit (by default set as 6 by mpi-operator) has been reached. https://github.com/kubernetes/website/pull/14709/files
A workaround will be setting restartPolicy="Never"
in launcher pod template. This will result into N+1 launcher pod remaining in Error state at the end after N backoff retries.
@terrytangyuan What was the motivation of using a batch job as the launcher?
cc @rongou to chime in the original motivation of using a batch job here.
The launcher's behavior is similar to a batch job. Looks like the log issue is with batch job itself. If you run a standalone batch job, you'd run into the same problems. Maybe file a bug with kubernetes?