mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

Support exitCode restartPolicy

Open Syulin7 opened this issue 1 year ago • 11 comments

MPIJob doesn't support restartPolicy=ExitCode

ref: https://github.com/kubeflow/training-operator/issues/1768

Syulin7 avatar Mar 03 '23 06:03 Syulin7

/kind feature

tenzen-y avatar Mar 03 '23 06:03 tenzen-y

Can you clarify what is the expected behavior?

I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

alculquicondor avatar Mar 03 '23 13:03 alculquicondor

If we add exitCode, we might be further from this goal https://github.com/kubeflow/training-operator/issues/1718

alculquicondor avatar Mar 03 '23 13:03 alculquicondor

Can you clarify what is the expected behavior?

I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

Probably, he expects a similar behavior to other training controllers (e.g., tfjob).

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173

If we add exitCode, we might be further from this goal https://github.com/kubeflow/training-operator/issues/1718

In the future, it would be good to support PodFailurePolicy once we introduce batch/v1 Job to workers.

However, it might be worth supporting a similar logic to other CustomJobs in the short term. This doesn't mean the changes break current logic and API.

WDYT?

tenzen-y avatar Mar 03 '23 14:03 tenzen-y

I guess we'll need a new API version anyways.

Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.

alculquicondor avatar Mar 03 '23 14:03 alculquicondor

I guess we'll need a new API version anyways.

I don't think we need a new API version since we can support restartPolicy=ExitCode, the same as other CustomJobs, using the following logic:

			// Get the exit code of the container.
			var exitCode int32 = 0xbeef // magic number
			for _, status := range pod.Status.ContainerStatuses {
				state := status.State
				if status.Name == r.GetDefaultContainerName() && state.Terminated != nil {
					exitCode = state.Terminated.ExitCode
					logger.Infof("Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
					r.GetRecorder().Eventf(job, corev1.EventTypeNormal, "ExitedWithCode", "Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
				}
			}
			// Check if the pod is retryable.
			if spec.RestartPolicy == commonv1.RestartPolicyExitCode {
				if pod.Status.Phase == corev1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
					failedPodsCount.Inc()
					logger.Infof("Need to restart the pod: %v.%v", pod.Namespace, pod.Name)
					if err = r.DeletePod(ctx, pod.Namespace, pod.Name); err != nil {
						return err
					}
				}
			}

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/reconciler.v1/common/pod.go#L172-L191

If we support a similarly logic to other CustomJobs, we don't provide a configurable ExitCode similar to PodFailurePolicy.

	// RestartPolicyExitCode policy means that user should add exit code by themselves,
	// The job operator will check these exit codes to
	// determine the behavior when an error occurs:
	// - 1-127: permanent error, do not restart.
	// - 128-255: retryable error, will restart the pod.
	RestartPolicyExitCode RestartPolicy = "ExitCode"

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173

tenzen-y avatar Mar 03 '23 14:03 tenzen-y

Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.

I agree.

tenzen-y avatar Mar 03 '23 14:03 tenzen-y

Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.

alculquicondor avatar Mar 03 '23 14:03 alculquicondor

Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.

Ah, I see. I agree. We need a new API version if we migrate workers to batch/Jobs.

In the short term, I would like to support the restartPolicy=ExitCode, similar to the other CustomJobs in MPIJob v2beta1. Since if we migrate everything to batch/job, we might need to use Elastic Indexed Job for the elastic training (e.g., elastic horovod).

WDYT?

tenzen-y avatar Mar 03 '23 14:03 tenzen-y

sounds good

alculquicondor avatar Mar 03 '23 15:03 alculquicondor

/assign

tenzen-y avatar Mar 03 '23 17:03 tenzen-y