levant icon indicating copy to clipboard operation
levant copied to clipboard

Checking batch job status fails

Open sokil opened this issue 2 years ago • 6 comments

I have batch job that perform some one-time short-running task. Successfull deploument looks like:

2022-06-29T16:00:17Z |INFO| levant/deploy: triggering a deployment job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/deploy: evaluation e9d76b4c-8f4b-68e5-05e3-eee20a82d225 finished successfully job_id=some_nomad_job_name
2022-06-29T16:00:18Z |DEBU| levant/job_status_checker: running job status checker for job job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/job_status_checker: job has status running job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/job_status_checker: task command in allocation 124b605d-518e-6292-5cd3-8decc4d033ec now in pending state job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/job_status_checker: task command in allocation 124b605d-518e-6292-5cd3-8decc4d033ec now in running state job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/job_status_checker: all allocations in deployment of job are running job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/deploy: job deployment successful job_id=some_nomad_job_name

Today i'v got error:

2022-07-06T14:57:01Z |INFO| levant/deploy: triggering a deployment job_id=some_nomad_job_name
2022-07-06T14:57:03Z |INFO| levant/deploy: evaluation ffa905f9-e937-e178-2e1a-d2b3d18ed8a8 finished successfully job_id=some_nomad_job_name
2022-07-06T14:57:03Z |DEBU| levant/job_status_checker: running job status checker for job job_id=some_nomad_job_name
2022-07-06T14:57:07Z |ERRO| levant/job_status_checker: job has status dead job_id=some_nomad_job_name
2022-07-06T14:57:07Z |ERRO| levant/deploy: job deployment failed job_id=some_nomad_job_name

In successful deployment time between "levant/job_status_checker: running job status checker for job" and first status is 0 seconds. In failed - 4 seconds. During this time my job was successfully finished and has status 'dead' but levant thinks that this task is just dead so it exited with non zero code and fails by CI pipeline.

As i see, levant have some problems with communication to nomad and its tooks to long time to get job status. Is it possible to disable check of job? because asynchronous checking of short lived tasks may fail unexpectedly

sokil avatar Jul 06 '22 15:07 sokil

I have the same problem, levant marks deployment as failed because it checks job status, which can be pending, running and dead This status can't tell us about was container or smth else exited successfully or not

DevKhaverko avatar Nov 19 '22 06:11 DevKhaverko

hi,

same issue .. I have a one-shot container which creates files and then exit 0 .. but pipeline is marked as failed:

2023-01-18T14:55:03Z |INFO| levant/job_status_checker: task django-collectstatic in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in dead state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |INFO| levant/job_status_checker: task django in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in running state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |INFO| levant/job_status_checker: task nginx in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in running state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |ERRO| levant/deploy: job deployment failed job_id=backoffice_gunicorn
Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit code 1

cu denny

linuxmail avatar Jan 18 '23 15:01 linuxmail

You can check status of allocation via cli. It works for checking until it won't be fixed

DevKhaverko avatar Jan 18 '23 15:01 DevKhaverko

You can check status of allocation via cli. It works for checking until it won't be fixed

Via levant or via Nomad Cli ? Can you give me an example? It sounds for me, that I then need to add an exit 0 and check the state on a separate task.

linuxmail avatar Jan 18 '23 18:01 linuxmail

      IDs=($(nomad job allocs -namespace "ns_name" -t '{{ $IDs := . }}{{ range $IDs }}{{ printf .ID }} {{ end }}' "job_name"))
      lastID="${IDs[0]}"
      status=$(nomad alloc status -namespace "ns_name" -short -t '{{ (index .ClientStatus) }}' "$lastID")
      if [[ "$status" != "complete" ]]; then
         echo "Job failed check error in logs: $NOMAD_ADDR/ui/allocations/$lastID/job_name-task/logs" 
         exit 1
      else
         echo "Job successfully finished"
      fi

DevKhaverko avatar Jan 19 '23 06:01 DevKhaverko

Also I missed checking while job is running, just add while loop before checking status "complete"

DevKhaverko avatar Feb 05 '23 08:02 DevKhaverko