k8s-wait-for icon indicating copy to clipboard operation
k8s-wait-for copied to clipboard

Wait for job does not work as expected

Open fdutton opened this issue 3 years ago • 6 comments

I was expecting for this app to wait until a job completed successfully but it only waited for the job to be ready. Am I misunderstanding something?

This is a portion of my deployment resource and I have verified that my job runs to completion and exits with a status code of 0.

initContainers:
  - name: data-migration-init
    image: 'groundnuty/k8s-wait-for:v1.7'
    args:
      - job
      - my-data-migration-job

fdutton avatar Sep 27 '22 19:09 fdutton

I most definitely use it to wait for a job to be completed. example:

        - name: wait-for-onezone
          image: {{ .Values.wait_for.image }}
          imagePullPolicy: {{ template "imagePullPolicy" dict "root" . "context" .Values.wait_for }}
          args:
            - "job"
            - "{{ template "onezone_name" . }}-ready-check"

Please try image groundnuty/k8s-wait-for:v1.5.1 I have not upgraded my production envs to the newest image. Mabe some bug got into it...

groundnuty avatar Sep 27 '22 19:09 groundnuty

Will do. Thanks to the quick response.

fdutton avatar Sep 27 '22 20:09 fdutton

Version 1.5.1 works as expected.

I'm not in production yet so I'm willing to help isolate the issue. I'll try a 1.6 version tomorrow and let you know the results.

fdutton avatar Sep 27 '22 22:09 fdutton

Had a hunch and it was right that's a diff between kubectl describe job <> between kubectl v1.24.0 and v1.25.2:

< Start Time:     Wed, 21 Sep 2022 11:03:23 +0200
< Pods Statuses:  1 Active / 0 Succeeded / 0 Failed
---
> Start Time:     Wed, 21 Sep 2022 09:03:23 +0000
> Pods Statuses:  1 Running / 0 Succeeded / 0 Failed

They changed Running to Active... not sure how it could break the code yet, since it uses regexp-es that should be ok with that...

groundnuty avatar Sep 28 '22 03:09 groundnuty

Version 1.6 does not work.

I diff'd wait_for.sh and don't see anything that would change its behavior.

v1.5.1 uses kubectl 1.21.0 and v1.6 uses kubectl 1.24.0 so there is probably a change there.

fdutton avatar Sep 28 '22 12:09 fdutton

noroot-v1.7 running on K8S 1.25 has the same issue and doesnt wait for the job to be successful.

Switched to v1.5.1 and works as expected. Would be nice to be runnng the noroot version :)

stephenpope avatar Oct 14 '22 16:10 stephenpope

Got hit by this as well, switched to v1.5.1 works as expected now.

anleib avatar Nov 04 '22 15:11 anleib

Also got hit by this in v1.7, is someone working on a fix?

DARB-CCM-S-20 avatar Nov 07 '22 10:11 DARB-CCM-S-20

I found the problem. After all, the regexp was not working after k8s changed this:

Pods Statuses:    0 Running / 1 Succeeded / 0 Failed
Pods Statuses:    1 Active (0 Ready) / 0 Succeeded / 0 Failed

The change is connected with feature gate JobReadyPods that as far as I find, was introduced k8s v1.23. It adds Ready info to JobStatus.

As far as I understand Ready should always be =< Active, as Active still counts scheduled but not yet Succeeded/Failed pods and Ready just gives extra info on which of them are actually running now.

Furthermore, it seems that v1.7 should work with k8s clusters < v1.23.

@fdutton , @anleib, @DARB-CCM-S-20, @stephenpope if you could possibly share on which k8s version did you experience your problems? So that we can be sure that my conclusions here are correct.

groundnuty avatar Nov 14 '22 10:11 groundnuty

@groundnuty Great work! 1.24 for me. I've internalized v1.7 for now and changed to v1.21 which is working fine.

DARB-CCM-S-20 avatar Nov 14 '22 11:11 DARB-CCM-S-20

I am on 1.24 K8s as well

anleib avatar Nov 22 '22 15:11 anleib

I am on 1.24 K8s as well

Running v1.24.14 and ended up having to use v1.5.1- newer versions just completed immediately

one-adam-nolan avatar Jul 05 '23 12:07 one-adam-nolan