mpi-operator Can't access launcher logs after MpiJob fails

Launcher job deletes all the pods after the number of retries exceeds backoffLimit or the total time exceeds activeDeadlineSeconds. Users just can't access the error logs after the MPIJob failed.

It's the Kubernetes batch Job's behavior to delete all the logs if the Job fails. Link to source code: https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/pkg/controller/job/job_controller.go#L520 https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/pkg/controller/job/job_controller.go#L602

Any advice?

Nov 25 '19 00:11 czheng94

@czheng94 Do you mean "cannot access worker logs" instead of "launcher logs"? There will still be logs for workers if kube batch is not used, right?

Nov 25 '19 18:11 terrytangyuan

@terrytangyuan I meant launcher logs. mpirun doesn't emit any logs in the workers. The key is that failed launcher pod will be deleted by kube batch job after it's done with retrying.

Nov 25 '19 19:11 czheng94

If that's how kube batch works now then I don't think there's a way to get around it since kube batch doesn't expose this option. I would recommend looking into persisting your logs to disk using a logging sidecar service.

Nov 25 '19 19:11 terrytangyuan

Indeed this is expected behavior of kube batch job. If you set restartPolicy = "OnFailure" in the launcher pod template, all pods will be terminated and deleted if backoff limit (by default set as 6 by mpi-operator) has been reached. https://github.com/kubernetes/website/pull/14709/files

A workaround will be setting restartPolicy="Never" in launcher pod template. This will result into N+1 launcher pod remaining in Error state at the end after N backoff retries.

@terrytangyuan What was the motivation of using a batch job as the launcher?

Dec 02 '19 15:12 czheng94

cc @rongou to chime in the original motivation of using a batch job here.

Dec 10 '19 15:12 terrytangyuan

The launcher's behavior is similar to a batch job. Looks like the log issue is with batch job itself. If you run a standalone batch job, you'd run into the same problems. Maybe file a bug with kubernetes?

Dec 13 '19 18:12 rongou

mpi-operator mpi-operator copied to clipboard

Can't access launcher logs after MpiJob fails

mpi-operator
mpi-operator copied to clipboard