jobset icon indicating copy to clipboard operation
jobset copied to clipboard

Job validation bug occurs in JobSet but not standalone Job: podFailurePolicy's "onPodConditions" field not treated as optional

Open danielvegamyhre opened this issue 1 year ago • 5 comments

A while ago we fixed this bug causing validation failure when the podFailurePolicy onPodConditions field is unset, even though it is an optional field: https://github.com/kubernetes/kubernetes/pull/120204. This change was cherry picked into 1.27 and my understanding is that using 1.27.7+ should allow me to use this change.

Despite using 1.27.9 for my GKE cluster master version, and v0.28.5 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:

~/go/src/sigs.k8s.io/jobset$ k apply -f examples/simple/configurable-failure-policy.yaml 
The JobSet "configurable-failure-policy" is invalid: 
* spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].onPodConditions: Invalid value: "null": spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].onPodConditions in body must be of type array: "null"
* spec.replicatedJobs[1].template.spec.podFailurePolicy.rules[0].onPodConditions: Invalid value: "null": spec.replicatedJobs[1].template.spec.podFailurePolicy.rules[0].onPodConditions in body must be of type array: "null"
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation

Here is the JobSet yaml:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: configurable-failure-policy
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment
spec:
  failurePolicy:
    rules:
    - action: FailJobSet
      onJobFailureReasons: 
      - PodFailurePolicy
    maxRestarts: 3
  replicatedJobs:
  - name: buggy-job 
    replicas: 1 # set to number of node pools
    template:
      spec:
        parallelism: 1 # number of nodes per pool
        completions: 1 # number of nodes per pool
        backoffLimit: 0
        # fail job immediately if job was not killed by SIGTERM (graceful node shutdown for maintenance events)
        podFailurePolicy:
          rules:
          - action: FailJob
            onExitCodes:
              containerName: main
              operator: NotIn
              values: [143] # SIGTERM = exit code 143
            onPodConditions: [] # Note this field must be specified as an empty list, even if unused
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: main
              image: bash:latest
              image: docker.io/library/bash:5
              command: ["bash"]
              args:
              - -c
              - echo "Hello world! I'm going to exit with exit code 1 to simulate a software bug." && sleep 20 && exit 1
  - name: normal-job
    replicas: 1 # set to number of node pools
    template:
      spec:
        parallelism: 1 # number of nodes per pool
        completions: 1 # number of nodes per pool
        backoffLimit: 0
        # fail job immediately if job was not killed by SIGTERM (graceful node shutdown for maintenance events)
        podFailurePolicy:
          rules:
          - action: FailJob
            onExitCodes:
              containerName: main
              operator: NotIn
              values: [143] # SIGTERM = exit code 143
            onPodConditions: [] # Note this field must be specified as an empty list, even if unused
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: main
              image: bash:latest
              image: docker.io/library/bash:5
              command: ["bash"]
              args:
              - -c
              - echo "Hello world! I'm going to exit with SIGTERM (exit code) to simulate host maintenace." && sleep 5 && exit 143

However, before the change, even if I set onPodConditions: [], validation still failed. Now if I set onPodConditions: [] the validation passes and things work as expected.

To investigate this further, I tried creating a standalone Job with the same podFailurePolicy, which worked:

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: pi
        operator: NotIn
        values: [143] # SIGTERM = exit code 143
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4
 

This leads me to believe the issue is specific to JobSet / the versions of the k8s Go packages we are using, so I am opening this bug.

/kind bug

danielvegamyhre avatar Jan 17 '24 19:01 danielvegamyhre

/kind bug

danielvegamyhre avatar Jan 17 '24 19:01 danielvegamyhre

I don't this is correct.

Despite using 1.27.9 for my GKE cluster master version, and v0.27.7 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:

The go.mod/go.sum seems to point to 0.28.4 rather than this one.

kannon92 avatar Jan 17 '24 22:01 kannon92

I don't this is correct.

Despite using 1.27.9 for my GKE cluster master version, and v0.27.7 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:

The go.mod/go.sum seems to point to 0.28.4 rather than this one.

You're right, I was looking at https://github.com/kubernetes-sigs/jobset/pull/301 when writing this. I'll update the description accordingly.

danielvegamyhre avatar Jan 17 '24 22:01 danielvegamyhre

@mimowo we still see that this is a problem. It looks fixed when using normal jobs but it still seems present for JobSet

kannon92 avatar Jan 23 '24 12:01 kannon92

Bump, this is being addressed by bumping to the fixed k8s API in https://github.com/kubernetes/kubernetes/pull/126046 (1.29.8 and 1.30.4 will include the fix).

mimowo avatar Jul 31 '24 07:07 mimowo

This has now been fixed since we bumped our k8s dependencies to v.1.31

Confirmed fix works in #600 which was previously blocked by this bug.

danielvegamyhre avatar Oct 05 '24 18:10 danielvegamyhre

Thank you for checking and the feedback!

mimowo avatar Oct 07 '24 06:10 mimowo