TrainJob suspend/resume fails with JobSet webhook validation error
What happened?
TrainJob suspend/resume functionality is broken when using JobSet runtime. Attempting to suspend or resume a TrainJob fails with a webhook validation error, preventing the suspend state from propagating to the underlying JobSet.
Error message :
admission webhook "vjobset.kb.io" denied the request:
spec.replicatedJobs: Invalid value: []v1alpha2.ReplicatedJob(nil): field is immutable
Root Cause
The controller uses Server-Side Apply (SSA) to update the JobSet. SSA sends an ApplyConfiguration containing all fields, including immutable ones like spec.replicatedJobs. The JobSet webhook validates that immutable fields haven't changed, but it can't distinguish "unchanged" from "changed" in ApplyConfigurations, so it rejects the update.
Proposed Solution
When only the suspend field is changing, use a strategic merge patch instead of SSA:
if suspendChanged {
patch := client.MergeFrom(oldJobSet.DeepCopy())
oldJobSet.Spec.Suspend = ptr.To(newSuspend)
if err := j.client.Patch(ctx, oldJobSet, patch); err != nil {
return nil, fmt.Errorf("failed to patch JobSet suspend field: %w", err)
}
return nil, nil
}
This sends only the suspend field, bypassing immutable field validation.
What did you expect to happen?
- Setting
spec.suspend: trueon a running TrainJob should suspend the JobSet and terminate pods - Setting
spec.suspend: falseon a suspended TrainJob should resume the JobSet and recreate pods - Multiple suspend/resume cycles should work reliably
Environment
How can we reproduce it (as minimally and precisely as possible)?
- Create a TrainJob with JobSet runtime
- Wait for pods to start running
- Update TrainJob:
kubectl patch trainjob my-job --type=merge -p '{"spec":{"suspend":true}}' - Observe error in controller logs and JobSet remains unsuspended
Environment
- Kubernetes version: v1.29
- Trainer version: v2.1
- JobSet version: v0.9+
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
@abhijeet-dhumal Can you please assign it to me?
Hi @Delta17920
You can use /assign label to assign this issue to yourself.. 👍
but first you would recommend to check whether it's reliably reproducible
Okk
/assign
/remove-label lifecycle/needs-triage /area controller
@abhijeet-dhumal Thanks for creating this! Did you try to modify any other fields in TrainJob while suspending it? We need to update our API to ensure only necessary fields are mutable.
I was able to suspend the following TrainJob in my local Kind cluster.
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: b541fa741e37
spec:
runtimeRef:
name: torch-distributed
suspend: true
trainer:
command:
- bash
- -c
- |2-
read -r -d '' SCRIPT << EOM
def test_function():
import time
for i in range(100):
print(i)
time.sleep(1)
test_function()
EOM
printf "%s" "$SCRIPT" > "test-job.py"
torchrun "test-job.py"
@tenzen-y @astefanutti @kaisoz @mimowo Do we have any Kueue E2Es test that verifies TrainJob preemption and usage of suspend API?
cc @JEETDESAI25 Did you also get the same error as @abhijeet-dhumal mentioned above?
Yeah, I got the same error and it only shows up when you patch a running TrainJob to suspend it, since that triggers the SSA update path. Creating with suspend already set works fine. @andreyvelich
@tenzen-y @astefanutti @kaisoz @mimowo Do we have any Kueue E2Es test that verifies TrainJob preemption and usage of
suspendAPI?
Unfortunately, all Kueue E2E tests start from an already suspended Trainjob. However, I just gave it a try and I'm unable to reproduce the issue...
I've cloned the Trainer code and moved to the v2.1.0 tag. Then I run make test-e2e-setup-cluster to create a local kind cluster with both Kubeflow and Jobset v0.10.1
In that cluster I created the test-wh-issue and applied @andreyvelich Trainjob with suspend=false
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: b541fa741e37
namespace: test-wh-issue
spec:
runtimeRef:
name: torch-distributed
suspend: false
trainer:
command:
- bash
- -c
- |2-
read -r -d '' SCRIPT << EOM
def test_function():
import time
for i in range(100):
print(i)
time.sleep(1)
test_function()
EOM
printf "%s" "$SCRIPT" > "test-job.py"
torchrun "test-job.py"
This created the corresponding resources
$> kubectl apply -f test-trainjob.yaml -n test-wh-issue
trainjob.trainer.kubeflow.org/b541fa741e37 created
$> kubectl get trainjobs -n test-wh-issue
NAME STATE AGE
b541fa741e37 7s
$> kubectl get jobsets -n test-wh-issue
NAME TERMINALSTATE RESTARTS COMPLETED SUSPENDED AGE
b541fa741e37 0 false 15s
$> kubectl get pods -n test-wh-issue
NAME READY STATUS RESTARTS AGE
b541fa741e37-node-0-0-sz7vk 1/1 Running 0 23s
now if I suspend the trainjob, the jobset gets suspended too with no issue
$> kubectl patch trainjob b541fa741e37 --type=merge -p '{"spec":{"suspend":true}}' -n test-wh-issue
trainjob.trainer.kubeflow.org/b541fa741e37 patched
$> kubectl get trainjobs -n test-wh-issue
NAME STATE AGE
b541fa741e37 Suspended 44s
$> kubectl get jobsets -n test-wh-issue
NAME TERMINALSTATE RESTARTS COMPLETED SUSPENDED AGE
b541fa741e37 0 true 46s
$> kubectl get pods -n test-wh-issue
No resources found in test-wh-issue namespace.
and I get no admission error message from the jobset controller
$> kubectl logs jobset-controller-manager-56f67769cc-wscf9 -n kubeflow-system | grep "denied the request"
now if I resume the trainjob, everything keeps on working
# Now THIS is a bug, it says that the trainjob is suspended when is not
$> k get trainjobs -n test-wh-issue
NAME STATE AGE
b541fa741e37 Suspended 4m48s
$> kubectl get trainjob b541fa741e37 -o json -n test-wh-issue | jq '.spec.suspend'
false
$> kubectl get jobsets -n test-wh-issue
NAME TERMINALSTATE RESTARTS COMPLETED SUSPENDED AGE
b541fa741e37 0 false 5m10s
$> kubectl get pods -n test-wh-issue
NAME READY STATUS RESTARTS AGE
b541fa741e37-node-0-0-rxx7q 1/1 Running 0 37s
@abhijeet-dhumal @JEETDESAI25 what am I missing? 🤷🏻
Yes, I also didn't get any errors to suspend the Running TrainJob.
@astefanutti @tenzen-y Do you know if we did any changes for v2.1.0 to fix the suspend error?
Not sure, I tested about a month ago with an older version. since it can't be reproduce on v2.1.0 with JobSet v0.10.1, it might have been fixed upstream in a newer JobSet version. @kaisoz
@kaisoz @andreyvelich @JEETDESAI25 @Delta17920 Thank you so much guys 🙌 if this issue is not reliably reproducible anymore then I'm good with closing it now, Thanks !
Great, let's close it for now. /close
@andreyvelich: Closing this issue.
In response to this:
Great, let's close it for now. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.