swarm-cronjob Add retry system

Add retry system

Open ggtools opened this issue 4 years ago • 1 comments

Behaviour

I noticed today that my nightly job hasn't run with the following message:

 Rejected 5 hours ago    "No such image: whatever-image:latest@sha256:[...]"

Steps to reproduce this issue

A bit complicated but might be a network issue, a Docker bug, etc.

Expected behaviour

As the job didn't run correctly, retry it. Swarm cronjob should probably keep track of the failed runs and retry a couple of times before giving up

Actual behaviour

Job is not restarted until the next slot

Configuration

Target Docker version (the host/cluster you manage) : 19.03.4
Platform (windows/linux) : Linux
System info (type uname -a) : Linux xxxxx 4.19.75-v7+ #1270 SMP Tue Sep 24 18:45:11 BST 2019 armv7l GNU/Linux
Target Swarm version : 1.6.0

Docker info

Output of command docker info

Logs

swarm-cronjob service logs (set LOG_LEVEL to debug) and cron based service logs if useful

May 15 '20 06:05 ggtools

Hello,

I faced the same issue recently, As a workaround, I tried to make use of the condition on-failure for restart-policy provided by Docker Swarm (see https://docs.docker.com/compose/compose-file/compose-file-v3/#restart_policy).

It seems to work with the following minimal example :

  test-exit1:
    image: alpine:3.12.5
    deploy:
      replicas: 0
      restart_policy:
        condition: on-failure
        max_attempts: 3
      labels:
        - "swarm.cronjob.enable=true"
        - "swarm.cronjob.schedule=*/5 * * * *"
        - "swarm.cronjob.skip-running=true"
    entrypoint: /bin/sh -c "echo 'test exit 1' && exit 1"

  test-exit0:
    image: alpine:3.12.5
    deploy:
      replicas: 0
      restart_policy:
        condition: on-failure
        max_attempts: 3
      labels:
        - "swarm.cronjob.enable=true"
        - "swarm.cronjob.schedule=*/5 * * * *"
        - "swarm.cronjob.skip-running=true"
    entrypoint: /bin/sh -c "echo 'test exit 0' && exit 0"

Results

tcdmgu79hgfb        swarm-cronjob-jobs_test-exit1.1                                 alpine:3.12.5   w1.lab.lan   Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
a1xypld4onk9         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan      Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
u02ret3246sv         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
vv79doiga2ej         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan      Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
l5lkanbekc4z        swarm-cronjob-jobs_test-exit0.1                                 alpine:3.12.5   w1.lab.lan   Shutdown            Complete about a minute ago                                      
znh3msh857qe         \_ swarm-cronjob-jobs_test-exit0.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Complete 6 minutes ago                                           
kybup9t116o1         \_ swarm-cronjob-jobs_test-exit0.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Complete 11 minutes ago

It shows two things :

When the job exits with 0, it does not restart.
When the job exits with 1, it restarts as expected, up to the number of max_attempts configured. The job is also restarted every 5 minutes by Swarm-Cronjob.

@ggtools if your job exits 0 on success, I guess using on-failure as a restart condition for the service should work.

@crazy-max Do you see any issue with the behavior I showed above ? If not, I would suggest updating the documentation to specify that on-failure can be used if the jobs exits 0. Currently, only the none condition is documented. I can make a PR if you don't have time for that :)

Jun 22 '21 10:06 camo-f

swarm-cronjob swarm-cronjob copied to clipboard

Add retry system

Behaviour

Steps to reproduce this issue

Expected behaviour

Actual behaviour

Configuration

Docker info

Logs

swarm-cronjob
swarm-cronjob copied to clipboard