kubectl kubectl wait hangs even after job is deleted

kubectl wait hangs even after job is deleted

Open bj8sk opened this issue 10 months ago • 5 comments

What happened: Started a job. Want to wait for end result of job, either complete or failed, but waiting for failed hangs, even after job is deleted by k8s. Following the suggestion in SO post https://stackoverflow.com/a/60286538, I start two kubectl wait's. The job has ttlSecondsAfterFinished: 300 backoffLimit: 0

The wait for complete status works and returns, but even though the job is deleted after about 5 minutes, still the wait for Failed process hangs on for 30 minutes: kubectl wait job/my-job --for=condition=Failed --timeout=1800s

What you expected to happen: If the job is completed why would it wait for Failed, it should just return some error code to indicate failure to wait for failed state.

How to reproduce it (as minimally and precisely as possible):

apiVersion: batch/v1
kind: Job
metadata:
  name: pi-with-ttl
spec:
  ttlSecondsAfterFinished: 100
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never

apply job
run wait for failed, it will wait even past job deletion: kubectl wait job/pi-with-ttl --for=condition=Failed --timeout=300s

Environment:

Kubernetes client and server versions (use kubectl version):
1.26.2 client, 1.25.6 server
Cloud provider or hardware configuration:
Azure AKS

Sep 11 '23 07:09 bj8sk

Thanks for raising this issue. @bj8sk it is possible to try the steps above in 1.27 client version?. We had made some improvements in wait command in 1.27.

Sep 11 '23 07:09 ardaguclu

Thank you, tried that but still same result. I set up trace level, and see that after a few seconds we get the json response out with completed state, but it still waits with request like this, every five minutes or so (when I set --timeout=1800): https://server:443/apis/batch/v1/namespaces/my-namespace/jobs?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dmy-job&resourceVersion=190601200&timeoutSeconds=552&watch=true

Sep 11 '23 09:09 bj8sk

/triage accepted

Sep 13 '23 16:09 brianpursley

/assign @sreeram-venkitesh

Sep 13 '23 16:09 brianpursley

@bj8sk I tried reproducing the issue in the following manner and kubectl wait didn't hang for me. My kubectl client version is 1.29. Can you try reproducing the issue with the latest version and check if the issue still happens? I had initially tried using your pi-with-ttl Job, in which case the Job was getting Completed instead of Failed. Please let me know if the issue persists. Here are the details of how I tried reproducing the issue.

kubectl version

❯ kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.4-eks-8cb36c9

The YAML for the Job I used to meet the Failed condition

apiVersion: batch/v1
kind: Job
metadata:
  name: sreerams-failing-job
  namespace: sreeram-dev
spec:
  ttlSecondsAfterFinished: 100
  template:
    spec:
      containers:
      - name: fail-container
        image: busybox
        command: ["/bin/sh", "-c"]
        args: ["exit 1"]
      restartPolicy: Never
  backoffLimit: 0

Running kubectl get to check on the Job and the Pod:

❯ k get jobs -n sreeram-dev
NAME                   COMPLETIONS   DURATION   AGE
sreerams-failing-job   0/1           81s        81s

❯ k get pods -n sreeram-dev
NAME                         READY   STATUS   RESTARTS   AGE
sreerams-failing-job-wj57m   0/1     Error    0          87s

Here's what I used to wait for the job's failure

❯ k wait --for=condition=Failed job/sreerams-failing-job -n sreeram-dev --timeout=300s
job.batch/sreerams-failing-job condition met

Jan 03 '24 15:01 sreeram-venkitesh

kubectl kubectl copied to clipboard

kubectl wait hangs even after job is deleted

kubectl
kubectl copied to clipboard