training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

fix(runtimes): propagate Trainer.NumNodes into TemplateSpec (Parallelism/Completions)

Open NarayanaSabari opened this issue 2 months ago • 2 comments

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes the RuntimeInfo helpers to correctly synchronize PodSets.Count values (set by numNodes) to the TemplateSpec.ObjApply (JobSet specification).

Previously, when external consumers like Kueue called RuntimeInfo(), the info.TemplateSpec.PodSets[*].Count was correctly updated by ML policy plugins (Torch, PlainML, MPI), but the underlying JobSet spec (info.TemplateSpec.ObjApply) still had the original Parallelism and Completions values. This required consumers to implement post-processing workarounds.

Changes:

  1. Added SyncPodSetsToTemplateSpec() method in runtime.go that syncs PodSets.Count to JobSet's Parallelism/Completions
  2. Called this method in RuntimeInfo() after RunEnforceMLPolicyPlugins() runs
  3. Added unit tests for the new sync function

Which issue(s) this PR fixes:

Fixes https://github.com/kubeflow/trainer/issues/3042

Related Kueue issue: https://github.com/kubernetes-sigs/kueue/issues/8287

Special notes for your reviewer:

  • The existing comment in runtime.go line 45 already mentioned: "ObjApply podSpecs and this PodSets should be kept in sync by info.SyncPodSetsToTemplateSpec()" - this PR implements that function.
  • Once this is merged, the workaround in Kueue's trainjob_controller.go can be removed.
  • All existing tests pass.

NarayanaSabari avatar Dec 24 '25 11:12 NarayanaSabari

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Dec 24 '25 11:12 google-oss-prow[bot]

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification! Thanks again for contributing to Kubeflow! 🙏

github-actions[bot] avatar Dec 24 '25 11:12 github-actions[bot]

/ok-to-test

andreyvelich avatar Jan 18 '26 23:01 andreyvelich