training-operator Training job restart enhancement

What you would like to be added?

Description

We are proposing changes to enhance training job restart, that can help avoid restart failures and delays in case of GPU instance/k8s node failures:

Better job restart trigger: add a k8s node watcher to watch for node condition and label change, so that Kubeflow training operator could trigger training job restart based on NodeCondition or NodeLabel change.
Add training job max retry count support. If one training job exceeds max retry count, the training job will be deleted.
Enforce pod force deletion during restart.

Why is this needed?

As mentioned in issue 2072, currently failed K8s nodes leave jobs hanging indefinitely. The planned solution is adding Pod Failure Policy and Pod Disruption Condition support. But when training job hit GPU failure, training job might stuck and Pod may not exit with failure status. We need better integration with K8s node fault detection or Nvidia GPU fault detection mechanisms, e.g. Node problem detector uses NodeCondition to report problems to apiserver. We want to add a k8s node watcher to keep monitoring NodeCondition and NodeLabel change to trigger the training job restart (e.g. delete all the pods related to the training job)
Current training operator restart policy does not support max retry count, if we set restartPolicy as restart on Failure, it will enter into infinite retry, it means one training job will occupy the resources infinitely after failure. We want to add a max retry count option.
Current DeletePod implementation does not do force pod deletion. Add a -force option for pod deletion that overrides the default 30s grace period. The default 30s grace period causes unnecessary delays to restarts, and some pod might stuck in Terminating status. We want to add force pod deletion option.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Jul 26 '24 21:07 emeraldbay

Thank you for creating this @emeraldbay!

Better job restart trigger:

Since the pod failure policy might not work and we require additional node watcher to detect GPU issues, do we want to implement this feature on TrainJob or JobSet level @tenzen-y ?

Add training job max retry count support. If one training job exceeds max retry count, the training job will be deleted.

This will be supported in V2 APIs: https://github.com/kubeflow/training-operator/pull/2171

Current DeletePod implementation does not do force pod deletion. Add a -force option for pod deletion that overrides the default 30s grace period. The default 30s grace period causes unnecessary delays to restarts, and some pod might stuck in Terminating status

@tenzen-y Is there a way to configure grace period on batch Job for pods ?

Jul 31 '24 15:07 andreyvelich

Thanks. Could you please provide some contexts about the difference between Kubeflow training operator v2 vs. JobSet? Is JobSet expected to eventually replace Kubeflow training operator in terms of training job submission?

Aug 03 '24 19:08 emeraldbay

Thanks. Could you please provide some contexts about the difference between Kubeflow training operator v2 vs. JobSet? Is JobSet expected to eventually replace Kubeflow training operator in terms of training job submission?

The training job submission will still be via Training Operator: https://github.com/kubeflow/training-operator/blob/1f336d01af2c1e305bd6e660e079ffea107a51a9/docs/proposals/2170-kubeflow-training-v2/README.md#user-roles-diagram. TrainJob will just create an appropriate JobSet and additional resources (e.g. hostfile for MPI) to orchestrate resources for model training.

Aug 05 '24 17:08 andreyvelich

Thanks. @tenzen-y Could you please help comment for above questions?

Aug 05 '24 20:08 emeraldbay

Any update on this?

Sep 02 '24 22:09 emeraldbay

Since the pod failure policy might not work and we require additional node watcher to detect GPU issues, do we want to implement this feature on TrainJob or JobSet level @tenzen-y ?

Sorry, I could not understand the reason for this. Why can the pod failure policy not detect Node problems?

Sep 03 '24 08:09 tenzen-y

When GPU failure happens, training job might just stuck and pod does not exit with failure.

Sep 03 '24 21:09 emeraldbay

When GPU failure happens, training job might just stuck and pod does not exit with failure.

As far as I know, there is not happened. If you can face the problem, that is a bug for the kubelet or device plugin. I would recommend to report to SIG Node.

Sep 03 '24 21:09 tenzen-y

One example is NCCL communication stuck due to GPU failure, eventually it will timeout and kueblet will react on that, but that might be a long wait depends on the NCCL timeout config. We have better signals from node level and we want to have the capability to act on that.

Nvidia device plugin might report missing GPU for some cases, but generally we see it did not cover all the failure patterns.

Sep 03 '24 21:09 emeraldbay

@emeraldbay For the NCCL communication error, don't you want to integrate custom error handlers in your PyTorch code in case of timeout ?

Nvidia device plugin might report missing GPU for some cases, but generally we see it did not cover all the failure patterns.

Do you know how device plugin detects such missing GPUs and how it reports the results ?

Sep 04 '24 12:09 andreyvelich

@andreyvelich In short, we want fast recovery before NCCL time out, instead of using NCCL time out to trigger our recovery/error handler. This is mainly because we saw other signals could tell us there are GPU failures.

Device plugin checks a subset of driver error logs and will change available GPU device count if it detects some failure, e.g. "Updated allocatable device="nvidia.com/gpu" allocatable=X". Overall Kubelet and Nvidia device plugin did not offer what we need right now.

For this issue, I am mainly trying to understand whether you are open to enhance job restart logic to consider node/GPU failures. If you guys think Kubelet and Nvidia device plugin are responsible for detection, and pod failure should be the only trigger of KTO job restart when node/GPU failures, pls let us know. Thanks

Sep 05 '24 14:09 emeraldbay

This is mainly because we saw other signals could tell us there are GPU failures.

What kind of signals do you monitor to detect such failures ? Do you track GPU utilization via Nvidia DCGM exporter or something else ?

If you guys think Kubelet and Nvidia device plugin are responsible for detection, and pod failure should be the only trigger of KTO job restart when node/GPU failures, pls let us know. Thanks

@kubeflow/wg-training-leads Any thoughts on this ? Should we detect such use-cases during the Training Operator orchestration logic ?

Sep 09 '24 12:09 andreyvelich

What kind of signals do you monitor to detect such failures ? Do you track GPU utilization via Nvidia DCGM exporter or something else ?

We have our own fault detection mechanism, through running some Daemonset to do continuous monitoring

Sep 09 '24 12:09 emeraldbay

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dec 08 '24 15:12 github-actions[bot]

Any further input for above issue?

Dec 08 '24 16:12 emeraldbay

Sorry, no updated here since we've been focusing on the Kubeflow Training V2 work @emeraldbay. Do you want to propose a KEP on what we can improve in the V2 to address this feature ?

Dec 09 '24 12:12 andreyvelich

Sure, besides https://github.com/kubeflow/training-operator/blob/1f336d01af2c1e305bd6e660e079ffea107a51a9/docs/proposals/2170-kubeflow-training-v2/README.md#user-roles-diagram, is there any other google docs I could read to catch up the latest Kubeflow Training V2 discussion?

Dec 09 '24 17:12 emeraldbay

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mar 09 '25 20:03 github-actions[bot]

Sorry for the late reply @emeraldbay! We've added the initial docs for Kubeflow Trainer under Kubeflow Website: https://www.kubeflow.org/docs/components/trainer/overview/, feel free to check them! Given that right now Kubeflow TrainJobs depends on Kubernetes Job, you can have fine-grained control over Pod restarts with PodFailurePolicy APIs: https://kubernetes.io/docs/tasks/job/pod-failure-policy/

Feel free to reach out us on Slack: #kubeflow-training, or open an issue, if you have more questions!

Mar 10 '25 01:03 andreyvelich