kapp icon indicating copy to clipboard operation
kapp copied to clipboard

`kapp deploy` fails when job sets `ttlSecondsAfterFinished: 0`

Open nak3 opened this issue 3 years ago • 13 comments

What steps did you take:

Here is the simple step to produce the issue.

1. Create job.yaml

cat <<EOF > job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  ttlSecondsAfterFinished: 0
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
EOF

Run kapp deploy

$ kapp deploy --yes --app serving --file a.yaml 

What happened:

The kapp command above fails due to jobs.batch "pi" not found as following:

$ kapp deploy --yes --app serving --file a.yaml 
Target cluster 'https://127.0.0.1:45759' (nodes: kind-control-plane, 1+)

Changes

Namespace  Name  Kind  Conds.  Age  Op      Op st.  Wait to    Rs  Ri  
default    pi    Job   -       -    create  -       reconcile  -   -  

Op:      1 create, 0 delete, 0 update, 0 noop
Wait to: 1 reconcile, 0 delete, 0 noop

8:51:56PM: ---- applying 1 changes [0/1 done] ----
8:51:56PM: create job/pi (batch/v1) namespace: default
8:51:56PM: ---- waiting on 1 changes [0/1 done] ----
8:51:56PM: ongoing: reconcile job/pi (batch/v1) namespace: default
8:51:56PM:  ^ Waiting to complete (0 active, 0 failed, 0 succeeded)
8:51:56PM:  L ongoing: waiting on pod/pi-6q462 (v1) namespace: default
8:51:56PM:     ^ Pending
8:51:57PM: ongoing: reconcile job/pi (batch/v1) namespace: default
8:51:57PM:  ^ Waiting to complete (1 active, 0 failed, 0 succeeded)
8:51:57PM:  L ongoing: waiting on pod/pi-6q462 (v1) namespace: default
8:51:57PM:     ^ Pending: ContainerCreating
8:52:00PM: ongoing: reconcile job/pi (batch/v1) namespace: default
8:52:00PM:  ^ Waiting to complete (1 active, 0 failed, 0 succeeded)
8:52:00PM:  L ok: waiting on pod/pi-6q462 (v1) namespace: default
8:52:04PM: error: reconcile job/pi (batch/v1) namespace: default

kapp: Error: waiting on reconcile job/pi (batch/v1) namespace: default:
  Errored:
    Getting resource job/pi (batch/v1) namespace: default: jobs.batch "pi" not found (reason: NotFound)

What did you expect:

  • I know that the root cause is ttlSecondsAfterFinished: 0 which cleans up the job.
  • So the issue has a workaround by setting ttlSecondsAfterFinished to non-0.
  • But released manifest has 0 e.g. contour https://github.com/projectcontour/contour/blob/552498a85a294a7080acdb8e20a5b70f67c4fd6b/examples/contour/02-job-certgen.yaml#L42 so I would like kapp to handle it by some option. (Otherwise, we need to modify contour's release yaml.)

nak3 avatar Oct 05 '21 11:10 nak3

The version info:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-19T11:52:07Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

$ kapp version
kapp version develop

Succeeded

(I installed kapp by go get -u github.com/k14s/kapp/cmd/kapp)

nak3 avatar Oct 05 '21 12:10 nak3

@nak3 I tried reproducing the issue using the steps that you have mentioned, but I am not able to. The deployment took a while (7m), but it was successful. Also, I am curious about which version of kapp you were using before updating to the latest one.

Version info:

$ kubectl version                                                                                                                                     
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T21:04:39Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

$ kapp version                                                                                                                                         
kapp version 0.41.0

Succeeded

praveenrewar avatar Oct 05 '21 12:10 praveenrewar

@praveenrewar I wonder your cluster didn't enable TTLAfterFinished https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ ? It is a feature gate. I tested the kapp from the https://github.com/vmware-tanzu/carvel-kapp/tree/8e1d1f706da29d9f31e003dcf7a7a413f54de75e build.

nak3 avatar Oct 05 '21 12:10 nak3

I wonder your cluster didn't enable TTLAfterFinished https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ ?

That could be the case, but the doc says that it is enabled by default. I will check this.

The explanation provided by you seems to be correct (to me), that the job is getting cleaned instantly with the ttl being 0. But then I am really curious about how it was working before.

praveenrewar avatar Oct 05 '21 12:10 praveenrewar

That could be the case, but the doc says that it is enabled by default. I will check this.

Update: I can't find the job in the app that I had created, so it seems that it was removed.

praveenrewar avatar Oct 05 '21 12:10 praveenrewar

Note: it's very consistent in our CI environment GKE - 1.21

Screen Shot 2021-10-05 at 9 13 14 AM

dprotaso avatar Oct 05 '21 13:10 dprotaso

Based on the discussion on Slack. The tests were working before on an earlier version of k8s, where ttl wasn't enabled. Starting 1.21, it's enabled by default and the test started failing.

praveenrewar avatar Oct 05 '21 13:10 praveenrewar

Here we are seeking inputs from the community on behaviour you expect for ttlSecondsAfterFinished: 0 from kapp. Currently, kapp wait for Job to reconcile but irrespective of whether Job get succeeded or failed it immediately got deleted by TTL Controller and hence kapp throws NotFound error. These are different scenarios on which we need your inputs - 


  • Job deployed successfully, Should kapp consider operation as succeeded ?
  • Job deployment failed, should kapp consider operation as Failure and terminate immediately or continue with the rest of the resources ?

cc: @nak3

sethiyash avatar Oct 22 '21 06:10 sethiyash

Job deployed successfully, Should kapp consider operation as succeeded ?

Yes, I think so.

Job deployment failed, should kapp consider operation as Failure and terminate immediately or continue with the rest of the resources ?

I would like kapp to consider operation as Failure and terminate immediately. It might be better to add an option to ignore the failure and continue with the rest of the resources, but I personally don't need it so far.

nak3 avatar Oct 22 '21 07:10 nak3

@nak3 one challenge here is that Job is/will-be gone before we can take a look at it. that would make failure behaviour non-deterministic.

cppforlife avatar Oct 22 '21 17:10 cppforlife

I think what makes sense to me is when ttlSecondsAfterFinished is set to 0 then kapp should take over ownership of the object's lifecycle

Thus - remove the 0 ttl from the spec when applying and wait for the job to succeed or fail and then capture logs/whatever and then delete it (unsure you can set the ttl after the job is finished)

dprotaso avatar Oct 26 '21 14:10 dprotaso

After discussion among us. We are uncomfortable changing the default behaviour in case of ttlSecondsAfterFinished is set to 0 as this can lead to different issues in long run. This are the following workaround we suggest to handle this in kapp -

  • Modify yaml given to kapp using ytt.
  • Add Config to opt in kapp for this special behaviour.

sethiyash avatar Oct 27 '21 16:10 sethiyash

We perform Modify yaml given to kapp using ytt. currently

This option Add Config to opt in kapp for this special behaviour. would be nice

dprotaso avatar Oct 27 '21 16:10 dprotaso