kapp
kapp copied to clipboard
`kapp deploy` fails when job sets `ttlSecondsAfterFinished: 0`
What steps did you take:
Here is the simple step to produce the issue.
1. Create job.yaml
cat <<EOF > job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
ttlSecondsAfterFinished: 0
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
EOF
Run kapp deploy
$ kapp deploy --yes --app serving --file a.yaml
What happened:
The kapp command above fails due to jobs.batch "pi"
not found as following:
$ kapp deploy --yes --app serving --file a.yaml
Target cluster 'https://127.0.0.1:45759' (nodes: kind-control-plane, 1+)
Changes
Namespace Name Kind Conds. Age Op Op st. Wait to Rs Ri
default pi Job - - create - reconcile - -
Op: 1 create, 0 delete, 0 update, 0 noop
Wait to: 1 reconcile, 0 delete, 0 noop
8:51:56PM: ---- applying 1 changes [0/1 done] ----
8:51:56PM: create job/pi (batch/v1) namespace: default
8:51:56PM: ---- waiting on 1 changes [0/1 done] ----
8:51:56PM: ongoing: reconcile job/pi (batch/v1) namespace: default
8:51:56PM: ^ Waiting to complete (0 active, 0 failed, 0 succeeded)
8:51:56PM: L ongoing: waiting on pod/pi-6q462 (v1) namespace: default
8:51:56PM: ^ Pending
8:51:57PM: ongoing: reconcile job/pi (batch/v1) namespace: default
8:51:57PM: ^ Waiting to complete (1 active, 0 failed, 0 succeeded)
8:51:57PM: L ongoing: waiting on pod/pi-6q462 (v1) namespace: default
8:51:57PM: ^ Pending: ContainerCreating
8:52:00PM: ongoing: reconcile job/pi (batch/v1) namespace: default
8:52:00PM: ^ Waiting to complete (1 active, 0 failed, 0 succeeded)
8:52:00PM: L ok: waiting on pod/pi-6q462 (v1) namespace: default
8:52:04PM: error: reconcile job/pi (batch/v1) namespace: default
kapp: Error: waiting on reconcile job/pi (batch/v1) namespace: default:
Errored:
Getting resource job/pi (batch/v1) namespace: default: jobs.batch "pi" not found (reason: NotFound)
What did you expect:
- I know that the root cause is
ttlSecondsAfterFinished: 0
which cleans up the job. - So the issue has a workaround by setting
ttlSecondsAfterFinished
to non-0
. - But released manifest has
0
e.g. contour https://github.com/projectcontour/contour/blob/552498a85a294a7080acdb8e20a5b70f67c4fd6b/examples/contour/02-job-certgen.yaml#L42 so I would likekapp
to handle it by some option. (Otherwise, we need to modify contour's release yaml.)
The version info:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-19T11:52:07Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
$ kapp version
kapp version develop
Succeeded
(I installed kapp by go get -u github.com/k14s/kapp/cmd/kapp
)
@nak3 I tried reproducing the issue using the steps that you have mentioned, but I am not able to. The deployment took a while (7m), but it was successful. Also, I am curious about which version of kapp you were using before updating to the latest one.
Version info:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T21:04:39Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
$ kapp version
kapp version 0.41.0
Succeeded
@praveenrewar I wonder your cluster didn't enable TTLAfterFinished https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ ? It is a feature gate. I tested the kapp from the https://github.com/vmware-tanzu/carvel-kapp/tree/8e1d1f706da29d9f31e003dcf7a7a413f54de75e build.
I wonder your cluster didn't enable TTLAfterFinished https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ ?
That could be the case, but the doc says that it is enabled by default. I will check this.
The explanation provided by you seems to be correct (to me), that the job is getting cleaned instantly with the ttl being 0. But then I am really curious about how it was working before.
That could be the case, but the doc says that it is enabled by default. I will check this.
Update: I can't find the job in the app that I had created, so it seems that it was removed.
Note: it's very consistent in our CI environment GKE - 1.21
data:image/s3,"s3://crabby-images/ceee9/ceee9172426d43362c028a96c2739072f4ca085b" alt="Screen Shot 2021-10-05 at 9 13 14 AM"
Based on the discussion on Slack. The tests were working before on an earlier version of k8s, where ttl wasn't enabled. Starting 1.21, it's enabled by default and the test started failing.
Here we are seeking inputs from the community on behaviour you expect for ttlSecondsAfterFinished: 0
from kapp
. Currently, kapp
wait for Job
to reconcile but irrespective of whether Job
get succeeded or failed it immediately got deleted by TTL Controller and hence kapp
throws NotFound error. These are different scenarios on which we need your inputs -
- Job deployed successfully, Should kapp consider operation as succeeded ?
- Job deployment failed, should kapp consider operation as Failure and terminate immediately or continue with the rest of the resources ?
cc: @nak3
Job deployed successfully, Should kapp consider operation as succeeded ?
Yes, I think so.
Job deployment failed, should kapp consider operation as Failure and terminate immediately or continue with the rest of the resources ?
I would like kapp to consider operation as Failure and terminate immediately. It might be better to add an option to ignore the failure and continue with the rest of the resources, but I personally don't need it so far.
@nak3 one challenge here is that Job is/will-be gone before we can take a look at it. that would make failure behaviour non-deterministic.
I think what makes sense to me is when ttlSecondsAfterFinished
is set to 0
then kapp should take over ownership of the object's lifecycle
Thus - remove the 0
ttl from the spec when applying and wait for the job to succeed or fail and then capture logs/whatever and then delete it (unsure you can set the ttl after the job is finished)
After discussion among us. We are uncomfortable changing the default behaviour in case of ttlSecondsAfterFinished
is set to 0
as this can lead to different issues in long run.
This are the following workaround we suggest to handle this in kapp -
We perform Modify yaml given to kapp using ytt.
currently
This option Add Config to opt in kapp for this special behaviour.
would be nice