swarm-cronjob
swarm-cronjob copied to clipboard
Add retry system
Behaviour
I noticed today that my nightly job hasn't run with the following message:
Rejected 5 hours ago "No such image: whatever-image:latest@sha256:[...]"
Steps to reproduce this issue
A bit complicated but might be a network issue, a Docker bug, etc.
Expected behaviour
As the job didn't run correctly, retry it. Swarm cronjob should probably keep track of the failed runs and retry a couple of times before giving up
Actual behaviour
Job is not restarted until the next slot
Configuration
- Target Docker version (the host/cluster you manage) : 19.03.4
- Platform (windows/linux) : Linux
- System info (type
uname -a
) : Linux xxxxx 4.19.75-v7+ #1270 SMP Tue Sep 24 18:45:11 BST 2019 armv7l GNU/Linux - Target Swarm version : 1.6.0
Docker info
Output of command
docker info
Logs
swarm-cronjob service logs (set LOG_LEVEL to debug) and cron based service logs if useful
Hello,
I faced the same issue recently, As a workaround, I tried to make use of the condition on-failure
for restart-policy
provided by Docker Swarm (see https://docs.docker.com/compose/compose-file/compose-file-v3/#restart_policy).
It seems to work with the following minimal example :
test-exit1:
image: alpine:3.12.5
deploy:
replicas: 0
restart_policy:
condition: on-failure
max_attempts: 3
labels:
- "swarm.cronjob.enable=true"
- "swarm.cronjob.schedule=*/5 * * * *"
- "swarm.cronjob.skip-running=true"
entrypoint: /bin/sh -c "echo 'test exit 1' && exit 1"
test-exit0:
image: alpine:3.12.5
deploy:
replicas: 0
restart_policy:
condition: on-failure
max_attempts: 3
labels:
- "swarm.cronjob.enable=true"
- "swarm.cronjob.schedule=*/5 * * * *"
- "swarm.cronjob.skip-running=true"
entrypoint: /bin/sh -c "echo 'test exit 0' && exit 0"
Results
tcdmgu79hgfb swarm-cronjob-jobs_test-exit1.1 alpine:3.12.5 w1.lab.lan Shutdown Failed about a minute ago "task: non-zero exit (1)"
a1xypld4onk9 \_ swarm-cronjob-jobs_test-exit1.1 alpine:3.12.5 w1.lab.lan Shutdown Failed about a minute ago "task: non-zero exit (1)"
u02ret3246sv \_ swarm-cronjob-jobs_test-exit1.1 alpine:3.12.5 w1.lab.lan Shutdown Failed about a minute ago "task: non-zero exit (1)"
vv79doiga2ej \_ swarm-cronjob-jobs_test-exit1.1 alpine:3.12.5 w1.lab.lan Shutdown Failed about a minute ago "task: non-zero exit (1)"
l5lkanbekc4z swarm-cronjob-jobs_test-exit0.1 alpine:3.12.5 w1.lab.lan Shutdown Complete about a minute ago
znh3msh857qe \_ swarm-cronjob-jobs_test-exit0.1 alpine:3.12.5 w1.lab.lan Shutdown Complete 6 minutes ago
kybup9t116o1 \_ swarm-cronjob-jobs_test-exit0.1 alpine:3.12.5 w1.lab.lan Shutdown Complete 11 minutes ago
It shows two things :
- When the job exits with 0, it does not restart.
- When the job exits with 1, it restarts as expected, up to the number of
max_attempts
configured. The job is also restarted every 5 minutes by Swarm-Cronjob.
@ggtools if your job exits 0 on success, I guess using on-failure
as a restart condition for the service should work.
@crazy-max Do you see any issue with the behavior I showed above ? If not, I would suggest updating the documentation to specify that on-failure
can be used if the jobs exits 0. Currently, only the none
condition is documented. I can make a PR if you don't have time for that :)