mpi-operator
mpi-operator copied to clipboard
Support exitCode restartPolicy
MPIJob doesn't support restartPolicy=ExitCode
ref: https://github.com/kubeflow/training-operator/issues/1768
/kind feature
Can you clarify what is the expected behavior?
I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy
If we add exitCode, we might be further from this goal https://github.com/kubeflow/training-operator/issues/1718
Can you clarify what is the expected behavior?
I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy
Probably, he expects a similar behavior to other training controllers (e.g., tfjob).
https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173
If we add exitCode, we might be further from this goal https://github.com/kubeflow/training-operator/issues/1718
In the future, it would be good to support PodFailurePolicy once we introduce batch/v1 Job to workers.
However, it might be worth supporting a similar logic to other CustomJobs in the short term. This doesn't mean the changes break current logic and API.
WDYT?
I guess we'll need a new API version anyways.
Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.
I guess we'll need a new API version anyways.
I don't think we need a new API version since we can support restartPolicy=ExitCode
, the same as other CustomJobs, using the following logic:
// Get the exit code of the container.
var exitCode int32 = 0xbeef // magic number
for _, status := range pod.Status.ContainerStatuses {
state := status.State
if status.Name == r.GetDefaultContainerName() && state.Terminated != nil {
exitCode = state.Terminated.ExitCode
logger.Infof("Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
r.GetRecorder().Eventf(job, corev1.EventTypeNormal, "ExitedWithCode", "Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
}
}
// Check if the pod is retryable.
if spec.RestartPolicy == commonv1.RestartPolicyExitCode {
if pod.Status.Phase == corev1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
failedPodsCount.Inc()
logger.Infof("Need to restart the pod: %v.%v", pod.Namespace, pod.Name)
if err = r.DeletePod(ctx, pod.Namespace, pod.Name); err != nil {
return err
}
}
}
https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/reconciler.v1/common/pod.go#L172-L191
If we support a similarly logic to other CustomJobs, we don't provide a configurable ExitCode similar to PodFailurePolicy.
// RestartPolicyExitCode policy means that user should add exit code by themselves,
// The job operator will check these exit codes to
// determine the behavior when an error occurs:
// - 1-127: permanent error, do not restart.
// - 128-255: retryable error, will restart the pod.
RestartPolicyExitCode RestartPolicy = "ExitCode"
https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/apis/common/v1/types.go#L167-L173
Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods.
I agree.
Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.
Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job.
Ah, I see. I agree. We need a new API version if we migrate workers to batch/Jobs.
In the short term, I would like to support the restartPolicy=ExitCode
, similar to the other CustomJobs in MPIJob v2beta1.
Since if we migrate everything to batch/job, we might need to use Elastic Indexed Job for the elastic training (e.g., elastic horovod).
WDYT?
sounds good
/assign