Don't require the Trainjob to be suspended to update podTemplateOverrides
What you would like to be added?
The Trainjob admission webhook currently requires the Trainjob to be suspended in order to update the podTemplateOverrides field. This forces external controllers (such as Kueue) to perform multiple API requests for a single logical update.
Why is this needed?
Removing this requirement would simplify controller workflows and reduce unnecessary API server traffic.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
cc @mimowo
+1, thank you for reporting, I think this webhook is too strong.
It should allow to update the podTemplateOverrides when transitioning from suspended -> unsuspended. This is safe, because the job was suspnded before. This is also how k8s core validation works, only checking oldJob, ptal here
There is some subtlety here. I think scheduling directives (taints/tolerations/nodeselectors) can be updated.
But anything other than that in PodTemplae cannot not be updated.
There is a feature gate to relax pod resources but it's alpha in 1.35.
TrainJob: Single API Request for Unsuspend + PodTemplateOverrides Update
Related Issues
- Trainer: kubeflow/trainer#3043
- Kueue: kubernetes-sigs/kueue#8296
Problem Statement
When Kueue starts a TrainJob, it currently requires two API requests:
- Update
podTemplateOverrides(inject NodeSelectors) - Set
suspend=false(unsuspend the job)
This is because the TrainJob admission webhook requires the job to be suspended before allowing podTemplateOverrides updates.
Impact
- Race conditions between the two API calls
- Unnecessary API server traffic
- Complex controller logic
Proposed Solution
Relax the Trainer webhook validation to allow podTemplateOverrides updates when transitioning from suspended=true → suspended=false in a single request.
Validation Logic Change
// Current: Requires job.Spec.Suspend == true to update podTemplateOverrides
// Proposed: Allow if oldJob.Spec.Suspend == true (job was suspended before this update)
This follows the same pattern as Kubernetes core Job validation.
Scope of Updates
Only scheduling directives should be updatable:
- NodeSelectors
- Tolerations
- Taints
Other PodTemplate fields remain immutable.
Implementation Plan
| Step | Repo | Task |
|---|---|---|
| 1 | kubeflow/trainer | Update webhook in pkg/runtime/framework/plugins/jobset/jobset.go to check oldJob.Spec.Suspend instead of newJob.Spec.Suspend |
| 2 | kubeflow/trainer | Add unit tests for the new validation logic |
| 3 | kubernetes-sigs/kueue | Update TrainJob controller to combine unsuspend + podTemplateOverrides in single API call |
| 4 | kubernetes-sigs/kueue | Remove workaround for two-step update |
Open Questions for Reviewers
-
Should we limit
podTemplateOverridesupdates to only scheduling-related fields, or allow the full spec during suspend→unsuspend transition? -
Is there any concern with allowing this update pattern from a security/audit perspective?
-
Are there other consumers of TrainJob that might be affected by this change?
References
There is some subtlety here. I think scheduling directives (taints/tolerations/nodeselectors) can be updated.
But anything other than that in PodTemplae cannot not be updated.
There is a feature gate to relax pod resources but it's alpha in 1.35.
This is restriction at Trainer level not at k8s job level :S
It is enforced here at jobset validation time. It is initiated by trainjob webhook, who ends up calling the plugins (such as Jobset) validation functions through the TrainingRuntime
cc @NarayanaSabari
Thanks for the detailed breakdown @kaisoz! This makes the path forward very clear.
To summarize the fix needed in pkg/runtime/framework/plugins/jobset/jobset.go at checkPodTemplateOverridesImmutability():
Current logic (line 167):
suspended := ptr.Equal(newObj.Spec.Suspend, ptr.To(true))
Proposed fix:
oldSuspended := ptr.Equal(oldObj.Spec.Suspend, ptr.To(true))
Then change the condition to check oldSuspended instead of suspended. This would allow updating podTemplateOverrides + setting suspend=false in a single API request, as long as the job was previously suspended.
This aligns with the Kubernetes core Job validation pattern that @mimowo referenced.
The existing safety check for JobSet.Status.ReplicatedJobsStatus[].Active > 0 (lines 174-182) still provides protection against modifying while pods are running.
I can work on a PR for this if there's agreement on the approach. Would also need a follow-up change in Kueue to combine the two API calls once this lands.
Thanks for the detailed breakdown @kaisoz! This makes the path forward very clear.
To summarize the fix needed in
pkg/runtime/framework/plugins/jobset/jobset.goatcheckPodTemplateOverridesImmutability():Current logic (line 167):
suspended := ptr.Equal(newObj.Spec.Suspend, ptr.To(true)) Proposed fix:
oldSuspended := ptr.Equal(oldObj.Spec.Suspend, ptr.To(true)) Then change the condition to check
oldSuspendedinstead ofsuspended. This would allow updatingpodTemplateOverrides+ settingsuspend=falsein a single API request, as long as the job was previously suspended.This aligns with the Kubernetes core Job validation pattern that @mimowo referenced.
The existing safety check for
JobSet.Status.ReplicatedJobsStatus[].Active > 0(lines 174-182) still provides protection against modifying while pods are running.I can work on a PR for this if there's agreement on the approach. Would also need a follow-up change in Kueue to combine the two API calls once this lands.
I think this should work. WDYT @astefanutti? (since you wrote the original implementation 😊 )
/assign
@kaisoz @astefanutti @mimowo @andreyvelich I have raised a PR for this issue, let me know if there is any changes need to make