training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Don't require the Trainjob to be suspended to update podTemplateOverrides

Open kaisoz opened this issue 2 months ago • 9 comments

What you would like to be added?

The Trainjob admission webhook currently requires the Trainjob to be suspended in order to update the podTemplateOverrides field. This forces external controllers (such as Kueue) to perform multiple API requests for a single logical update.

Why is this needed?

Removing this requirement would simplify controller workflows and reduce unnecessary API server traffic.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

kaisoz avatar Dec 17 '25 14:12 kaisoz

cc @mimowo

kaisoz avatar Dec 17 '25 14:12 kaisoz

+1, thank you for reporting, I think this webhook is too strong.

It should allow to update the podTemplateOverrides when transitioning from suspended -> unsuspended. This is safe, because the job was suspnded before. This is also how k8s core validation works, only checking oldJob, ptal here

mimowo avatar Dec 19 '25 08:12 mimowo

There is some subtlety here. I think scheduling directives (taints/tolerations/nodeselectors) can be updated.

But anything other than that in PodTemplae cannot not be updated.

There is a feature gate to relax pod resources but it's alpha in 1.35.

kannon92 avatar Dec 20 '25 14:12 kannon92

TrainJob: Single API Request for Unsuspend + PodTemplateOverrides Update

Related Issues


Problem Statement

When Kueue starts a TrainJob, it currently requires two API requests:

  1. Update podTemplateOverrides (inject NodeSelectors)
  2. Set suspend=false (unsuspend the job)

This is because the TrainJob admission webhook requires the job to be suspended before allowing podTemplateOverrides updates.

Impact

  • Race conditions between the two API calls
  • Unnecessary API server traffic
  • Complex controller logic

Proposed Solution

Relax the Trainer webhook validation to allow podTemplateOverrides updates when transitioning from suspended=truesuspended=false in a single request.

Validation Logic Change

// Current: Requires job.Spec.Suspend == true to update podTemplateOverrides
// Proposed: Allow if oldJob.Spec.Suspend == true (job was suspended before this update)

This follows the same pattern as Kubernetes core Job validation.

Scope of Updates

Only scheduling directives should be updatable:

  • NodeSelectors
  • Tolerations
  • Taints

Other PodTemplate fields remain immutable.


Implementation Plan

Step Repo Task
1 kubeflow/trainer Update webhook in pkg/runtime/framework/plugins/jobset/jobset.go to check oldJob.Spec.Suspend instead of newJob.Spec.Suspend
2 kubeflow/trainer Add unit tests for the new validation logic
3 kubernetes-sigs/kueue Update TrainJob controller to combine unsuspend + podTemplateOverrides in single API call
4 kubernetes-sigs/kueue Remove workaround for two-step update

Open Questions for Reviewers

  1. Should we limit podTemplateOverrides updates to only scheduling-related fields, or allow the full spec during suspend→unsuspend transition?

  2. Is there any concern with allowing this update pattern from a security/audit perspective?

  3. Are there other consumers of TrainJob that might be affected by this change?


References

NarayanaSabari avatar Dec 24 '25 11:12 NarayanaSabari

There is some subtlety here. I think scheduling directives (taints/tolerations/nodeselectors) can be updated.

But anything other than that in PodTemplae cannot not be updated.

There is a feature gate to relax pod resources but it's alpha in 1.35.

This is restriction at Trainer level not at k8s job level :S

It is enforced here at jobset validation time. It is initiated by trainjob webhook, who ends up calling the plugins (such as Jobset) validation functions through the TrainingRuntime

cc @NarayanaSabari

kaisoz avatar Dec 28 '25 22:12 kaisoz

Thanks for the detailed breakdown @kaisoz! This makes the path forward very clear.

To summarize the fix needed in pkg/runtime/framework/plugins/jobset/jobset.go at checkPodTemplateOverridesImmutability():

Current logic (line 167):

suspended := ptr.Equal(newObj.Spec.Suspend, ptr.To(true))

Proposed fix:

oldSuspended := ptr.Equal(oldObj.Spec.Suspend, ptr.To(true))

Then change the condition to check oldSuspended instead of suspended. This would allow updating podTemplateOverrides + setting suspend=false in a single API request, as long as the job was previously suspended.

This aligns with the Kubernetes core Job validation pattern that @mimowo referenced.

The existing safety check for JobSet.Status.ReplicatedJobsStatus[].Active > 0 (lines 174-182) still provides protection against modifying while pods are running.

I can work on a PR for this if there's agreement on the approach. Would also need a follow-up change in Kueue to combine the two API calls once this lands.

NarayanaSabari avatar Dec 29 '25 06:12 NarayanaSabari

Thanks for the detailed breakdown @kaisoz! This makes the path forward very clear.

To summarize the fix needed in pkg/runtime/framework/plugins/jobset/jobset.go at checkPodTemplateOverridesImmutability():

Current logic (line 167):

suspended := ptr.Equal(newObj.Spec.Suspend, ptr.To(true)) Proposed fix:

oldSuspended := ptr.Equal(oldObj.Spec.Suspend, ptr.To(true)) Then change the condition to check oldSuspended instead of suspended. This would allow updating podTemplateOverrides + setting suspend=false in a single API request, as long as the job was previously suspended.

This aligns with the Kubernetes core Job validation pattern that @mimowo referenced.

The existing safety check for JobSet.Status.ReplicatedJobsStatus[].Active > 0 (lines 174-182) still provides protection against modifying while pods are running.

I can work on a PR for this if there's agreement on the approach. Would also need a follow-up change in Kueue to combine the two API calls once this lands.

I think this should work. WDYT @astefanutti? (since you wrote the original implementation 😊 )

kaisoz avatar Jan 08 '26 00:01 kaisoz

/assign

NarayanaSabari avatar Jan 11 '26 05:01 NarayanaSabari

@kaisoz @astefanutti @mimowo @andreyvelich I have raised a PR for this issue, let me know if there is any changes need to make

NarayanaSabari avatar Jan 24 '26 05:01 NarayanaSabari