`tilt ci` does not recreate failed jobs
Expected Behavior
When tilt ci runs, if it finds a job already exists (and the pod-template-hash matches) but the job has never completed (all pods exited with error) it would recreate the job to start it again.
I'm not sure if this is a bug or a feature request, because I can't find any docs that say it should work this way. Maybe I'm assuming something based on how tilt up behaves?
Current Behavior
tilt seems to attach to an arbitrary pod that has already terminated (it does not appear to be the most recent or the oldest pod). The output in the logs is:
Attaching to existing pod (db-init-cnbxf). Only new logs will be streamed.
Then tilt ci exits immediately with error: Error: Pod "db-init-cnbxf" failed.
Steps to Reproduce
-
Given these files:
script.sh
#!/usr/bin/env sh date echo some output exit 1Tiltfile
load('ext://deployment', 'job_create') docker_build('db-init', '.', dockerfile_contents=""" FROM busybox COPY script.sh script.sh ENTRYPOINT /script.sh """) job_create('db-init') -
Run
tilt cionce, and the output from this job is printed with the date. -
Run
tilt ciagain many times and the pod never runs again (the job controller will recreate the pod occasionally if the spec allows it).tilt cisays its' attaching to the terminated pod, then exits.
Context
About Your Use Case
We create environments for CI ahead of time using tilt ci. When one of those fails due to some flaky test or infrastructure problem we attempt to retry with tilt ci. We've noticed those retries don't end up working most of the time because of this behaviour.
oof, this is tricky. The short version is that this is currently working as designed.
re: "I can't find any docs that say it should work this way" - here's a good doc on tilt's execution model - https://docs.tilt.dev/controlloop. Basically, you can think of it as docker build && kubectl apply && kubectl wait. tilt ci mainly adds exit conditions.
The fundamental problem is that if you kubectl apply a job, and the spec of the Job hasn't changed, then (from Kubernetes' perspective), there's no reason to re-run the job. From the apiserver perspective, the whole contract of apply if that if the spec of an object hasn't changed, then the system should do nothing.
Tilt inherits this behavior -- if the Job hasn't changed, then the Job shouldn't be re-run.
There have been discussions of this over the years (e.g., https://github.com/kubernetes/kubernetes/issues/77396), but lots of stuff relies on this behavior.
I guess the simple workaround right now is to add to your tiltfile like:
if config.tilt_subcommand == 'ci'):
local('./clean-up-old-jobs.sh')
though i agree that's unsatisfying :(