training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

TrainJob suspend/resume fails with JobSet webhook validation error

Open abhijeet-dhumal opened this issue 3 months ago • 4 comments

What happened?

TrainJob suspend/resume functionality is broken when using JobSet runtime. Attempting to suspend or resume a TrainJob fails with a webhook validation error, preventing the suspend state from propagating to the underlying JobSet.

Error message :

admission webhook "vjobset.kb.io" denied the request:
spec.replicatedJobs: Invalid value: []v1alpha2.ReplicatedJob(nil): field is immutable

Root Cause

The controller uses Server-Side Apply (SSA) to update the JobSet. SSA sends an ApplyConfiguration containing all fields, including immutable ones like spec.replicatedJobs. The JobSet webhook validates that immutable fields haven't changed, but it can't distinguish "unchanged" from "changed" in ApplyConfigurations, so it rejects the update.

Proposed Solution

When only the suspend field is changing, use a strategic merge patch instead of SSA:

if suspendChanged {
    patch := client.MergeFrom(oldJobSet.DeepCopy())
    oldJobSet.Spec.Suspend = ptr.To(newSuspend)
    if err := j.client.Patch(ctx, oldJobSet, patch); err != nil {
        return nil, fmt.Errorf("failed to patch JobSet suspend field: %w", err)
    }
    return nil, nil
}

This sends only the suspend field, bypassing immutable field validation.

What did you expect to happen?

  • Setting spec.suspend: true on a running TrainJob should suspend the JobSet and terminate pods
  • Setting spec.suspend: false on a suspended TrainJob should resume the JobSet and recreate pods
  • Multiple suspend/resume cycles should work reliably

Environment

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a TrainJob with JobSet runtime
  2. Wait for pods to start running
  3. Update TrainJob: kubectl patch trainjob my-job --type=merge -p '{"spec":{"suspend":true}}'
  4. Observe error in controller logs and JobSet remains unsuspended

Environment

  • Kubernetes version: v1.29
  • Trainer version: v2.1
  • JobSet version: v0.9+

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

abhijeet-dhumal avatar Nov 27 '25 09:11 abhijeet-dhumal

@abhijeet-dhumal Can you please assign it to me?

Delta17920 avatar Nov 28 '25 04:11 Delta17920

Hi @Delta17920 You can use /assign label to assign this issue to yourself.. 👍 but first you would recommend to check whether it's reliably reproducible

abhijeet-dhumal avatar Nov 28 '25 07:11 abhijeet-dhumal

Okk

Delta17920 avatar Nov 28 '25 07:11 Delta17920

/assign

JEETDESAI25 avatar Dec 16 '25 23:12 JEETDESAI25

/remove-label lifecycle/needs-triage /area controller

@abhijeet-dhumal Thanks for creating this! Did you try to modify any other fields in TrainJob while suspending it? We need to update our API to ensure only necessary fields are mutable.

I was able to suspend the following TrainJob in my local Kind cluster.

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: b541fa741e37
spec:
  runtimeRef:
    name: torch-distributed
  suspend: true
  trainer:
    command:
    - bash
    - -c
    - |2-

      read -r -d '' SCRIPT << EOM

      def test_function():
          import time

          for i in range(100):
              print(i)
              time.sleep(1)

      test_function()

      EOM
      printf "%s" "$SCRIPT" > "test-job.py"
      torchrun "test-job.py"

@tenzen-y @astefanutti @kaisoz @mimowo Do we have any Kueue E2Es test that verifies TrainJob preemption and usage of suspend API?

andreyvelich avatar Jan 24 '26 02:01 andreyvelich

cc @JEETDESAI25 Did you also get the same error as @abhijeet-dhumal mentioned above?

andreyvelich avatar Jan 25 '26 19:01 andreyvelich

Yeah, I got the same error and it only shows up when you patch a running TrainJob to suspend it, since that triggers the SSA update path. Creating with suspend already set works fine. @andreyvelich

JEETDESAI25 avatar Jan 25 '26 23:01 JEETDESAI25

@tenzen-y @astefanutti @kaisoz @mimowo Do we have any Kueue E2Es test that verifies TrainJob preemption and usage of suspend API?

Unfortunately, all Kueue E2E tests start from an already suspended Trainjob. However, I just gave it a try and I'm unable to reproduce the issue...

I've cloned the Trainer code and moved to the v2.1.0 tag. Then I run make test-e2e-setup-cluster to create a local kind cluster with both Kubeflow and Jobset v0.10.1

In that cluster I created the test-wh-issue and applied @andreyvelich Trainjob with suspend=false

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: b541fa741e37
  namespace: test-wh-issue
spec:
  runtimeRef:
    name: torch-distributed
  suspend: false
  trainer:
    command:
    - bash
    - -c
    - |2-

      read -r -d '' SCRIPT << EOM

      def test_function():
          import time

          for i in range(100):
              print(i)
              time.sleep(1)

      test_function()

      EOM
      printf "%s" "$SCRIPT" > "test-job.py"
      torchrun "test-job.py"

This created the corresponding resources

$> kubectl apply -f test-trainjob.yaml -n test-wh-issue
trainjob.trainer.kubeflow.org/b541fa741e37 created

$> kubectl get trainjobs -n test-wh-issue
NAME           STATE   AGE
b541fa741e37           7s

$> kubectl get jobsets -n test-wh-issue
NAME           TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
b541fa741e37                   0                      false       15s

$> kubectl get pods -n test-wh-issue
NAME                          READY   STATUS    RESTARTS   AGE
b541fa741e37-node-0-0-sz7vk   1/1     Running   0          23s

now if I suspend the trainjob, the jobset gets suspended too with no issue

$>  kubectl patch trainjob b541fa741e37 --type=merge -p '{"spec":{"suspend":true}}' -n test-wh-issue
trainjob.trainer.kubeflow.org/b541fa741e37 patched

$> kubectl get trainjobs -n test-wh-issue
NAME           STATE       AGE
b541fa741e37   Suspended   44s

$> kubectl get jobsets -n test-wh-issue
NAME           TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
b541fa741e37                   0                      true        46s

$> kubectl get pods -n test-wh-issue
No resources found in test-wh-issue namespace.

and I get no admission error message from the jobset controller

$> kubectl logs jobset-controller-manager-56f67769cc-wscf9 -n kubeflow-system  | grep "denied the request"

now if I resume the trainjob, everything keeps on working

# Now THIS is a bug, it says that the trainjob is suspended when is not
$> k get trainjobs -n test-wh-issue
NAME           STATE       AGE
b541fa741e37   Suspended   4m48s

$> kubectl get trainjob b541fa741e37 -o json -n test-wh-issue | jq '.spec.suspend'
false

$>  kubectl get jobsets -n test-wh-issue
NAME           TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
b541fa741e37                   0                      false       5m10s

$> kubectl get pods -n test-wh-issue
NAME                          READY   STATUS    RESTARTS   AGE
b541fa741e37-node-0-0-rxx7q   1/1     Running   0          37s

@abhijeet-dhumal @JEETDESAI25 what am I missing? 🤷🏻

kaisoz avatar Jan 26 '26 15:01 kaisoz

Yes, I also didn't get any errors to suspend the Running TrainJob.

@astefanutti @tenzen-y Do you know if we did any changes for v2.1.0 to fix the suspend error?

andreyvelich avatar Jan 26 '26 15:01 andreyvelich

Not sure, I tested about a month ago with an older version. since it can't be reproduce on v2.1.0 with JobSet v0.10.1, it might have been fixed upstream in a newer JobSet version. @kaisoz

JEETDESAI25 avatar Jan 26 '26 17:01 JEETDESAI25

@kaisoz @andreyvelich @JEETDESAI25 @Delta17920 Thank you so much guys 🙌 if this issue is not reliably reproducible anymore then I'm good with closing it now, Thanks !

abhijeet-dhumal avatar Jan 27 '26 15:01 abhijeet-dhumal

Great, let's close it for now. /close

andreyvelich avatar Jan 27 '26 15:01 andreyvelich

@andreyvelich: Closing this issue.

In response to this:

Great, let's close it for now. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Jan 27 '26 15:01 google-oss-prow[bot]