mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

Can't access launcher logs after MpiJob fails

Open czheng94 opened this issue 5 years ago • 6 comments

Launcher job deletes all the pods after the number of retries exceeds backoffLimit or the total time exceeds activeDeadlineSeconds. Users just can't access the error logs after the MPIJob failed.

It's the Kubernetes batch Job's behavior to delete all the logs if the Job fails. Link to source code: https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/pkg/controller/job/job_controller.go#L520 https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/pkg/controller/job/job_controller.go#L602

Any advice?

czheng94 avatar Nov 25 '19 00:11 czheng94

@czheng94 Do you mean "cannot access worker logs" instead of "launcher logs"? There will still be logs for workers if kube batch is not used, right?

terrytangyuan avatar Nov 25 '19 18:11 terrytangyuan

@terrytangyuan I meant launcher logs. mpirun doesn't emit any logs in the workers. The key is that failed launcher pod will be deleted by kube batch job after it's done with retrying.

czheng94 avatar Nov 25 '19 19:11 czheng94

If that's how kube batch works now then I don't think there's a way to get around it since kube batch doesn't expose this option. I would recommend looking into persisting your logs to disk using a logging sidecar service.

terrytangyuan avatar Nov 25 '19 19:11 terrytangyuan

Indeed this is expected behavior of kube batch job. If you set restartPolicy = "OnFailure" in the launcher pod template, all pods will be terminated and deleted if backoff limit (by default set as 6 by mpi-operator) has been reached. https://github.com/kubernetes/website/pull/14709/files

A workaround will be setting restartPolicy="Never" in launcher pod template. This will result into N+1 launcher pod remaining in Error state at the end after N backoff retries.

@terrytangyuan What was the motivation of using a batch job as the launcher?

czheng94 avatar Dec 02 '19 15:12 czheng94

cc @rongou to chime in the original motivation of using a batch job here.

terrytangyuan avatar Dec 10 '19 15:12 terrytangyuan

The launcher's behavior is similar to a batch job. Looks like the log issue is with batch job itself. If you run a standalone batch job, you'd run into the same problems. Maybe file a bug with kubernetes?

rongou avatar Dec 13 '19 18:12 rongou