jobset
jobset copied to clipboard
Job validation bug occurs in JobSet but not standalone Job: podFailurePolicy's "onPodConditions" field not treated as optional
A while ago we fixed this bug causing validation failure when the podFailurePolicy onPodConditions field is unset, even though it is an optional field: https://github.com/kubernetes/kubernetes/pull/120204. This change was cherry picked into 1.27 and my understanding is that using 1.27.7+ should allow me to use this change.
Despite using 1.27.9 for my GKE cluster master version, and v0.28.5 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:
~/go/src/sigs.k8s.io/jobset$ k apply -f examples/simple/configurable-failure-policy.yaml
The JobSet "configurable-failure-policy" is invalid:
* spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].onPodConditions: Invalid value: "null": spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].onPodConditions in body must be of type array: "null"
* spec.replicatedJobs[1].template.spec.podFailurePolicy.rules[0].onPodConditions: Invalid value: "null": spec.replicatedJobs[1].template.spec.podFailurePolicy.rules[0].onPodConditions in body must be of type array: "null"
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation
Here is the JobSet yaml:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: configurable-failure-policy
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment
spec:
failurePolicy:
rules:
- action: FailJobSet
onJobFailureReasons:
- PodFailurePolicy
maxRestarts: 3
replicatedJobs:
- name: buggy-job
replicas: 1 # set to number of node pools
template:
spec:
parallelism: 1 # number of nodes per pool
completions: 1 # number of nodes per pool
backoffLimit: 0
# fail job immediately if job was not killed by SIGTERM (graceful node shutdown for maintenance events)
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main
operator: NotIn
values: [143] # SIGTERM = exit code 143
onPodConditions: [] # Note this field must be specified as an empty list, even if unused
template:
spec:
restartPolicy: Never
containers:
- name: main
image: bash:latest
image: docker.io/library/bash:5
command: ["bash"]
args:
- -c
- echo "Hello world! I'm going to exit with exit code 1 to simulate a software bug." && sleep 20 && exit 1
- name: normal-job
replicas: 1 # set to number of node pools
template:
spec:
parallelism: 1 # number of nodes per pool
completions: 1 # number of nodes per pool
backoffLimit: 0
# fail job immediately if job was not killed by SIGTERM (graceful node shutdown for maintenance events)
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main
operator: NotIn
values: [143] # SIGTERM = exit code 143
onPodConditions: [] # Note this field must be specified as an empty list, even if unused
template:
spec:
restartPolicy: Never
containers:
- name: main
image: bash:latest
image: docker.io/library/bash:5
command: ["bash"]
args:
- -c
- echo "Hello world! I'm going to exit with SIGTERM (exit code) to simulate host maintenace." && sleep 5 && exit 143
However, before the change, even if I set onPodConditions: [], validation still failed. Now if I set onPodConditions: [] the validation passes and things work as expected.
To investigate this further, I tried creating a standalone Job with the same podFailurePolicy, which worked:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: pi
operator: NotIn
values: [143] # SIGTERM = exit code 143
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
This leads me to believe the issue is specific to JobSet / the versions of the k8s Go packages we are using, so I am opening this bug.
/kind bug
/kind bug
I don't this is correct.
Despite using 1.27.9 for my GKE cluster master version, and v0.27.7 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:
The go.mod/go.sum seems to point to 0.28.4 rather than this one.
I don't this is correct.
Despite using 1.27.9 for my GKE cluster master version, and v0.27.7 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:
The go.mod/go.sum seems to point to 0.28.4 rather than this one.
You're right, I was looking at https://github.com/kubernetes-sigs/jobset/pull/301 when writing this. I'll update the description accordingly.
@mimowo we still see that this is a problem. It looks fixed when using normal jobs but it still seems present for JobSet
Bump, this is being addressed by bumping to the fixed k8s API in https://github.com/kubernetes/kubernetes/pull/126046 (1.29.8 and 1.30.4 will include the fix).
This has now been fixed since we bumped our k8s dependencies to v.1.31
Confirmed fix works in #600 which was previously blocked by this bug.
Thank you for checking and the feedback!