yu lin

Results 5 issues of yu lin

MPIJob doesn't support restartPolicy=ExitCode ref: https://github.com/kubeflow/training-operator/issues/1768

kind/feature

Based on Arena 2024 roadmap, to enhance code quality and stability, I think we need to complete the following tasks. - [x] Run `go fmt`, `go vet` on the entire...

add MaxConcurrentReconciles to JobControllerConfiguration fixes: https://github.com/kubeflow/training-operator/pull/1707#discussion_r1057075176

size/S

/kind bug **What steps did you take and what happened:** When updating the inferenceService, deployment's replicas are not set, the deployment replicas will initially be set to 1, and then...

kind/bug

**What this PR does / why we need it**: Add DeepSpeed Example with Pytorch Operator. The script used is HelloDeepSpeed from [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md). **Which issue(s) this PR fixes** _(optional, in `Fixes...

size/XL