`tilt ci` does not recreate failed jobs

Open dnephin opened this issue 2 years ago • 1 comments

Expected Behavior

When tilt ci runs, if it finds a job already exists (and the pod-template-hash matches) but the job has never completed (all pods exited with error) it would recreate the job to start it again.

I'm not sure if this is a bug or a feature request, because I can't find any docs that say it should work this way. Maybe I'm assuming something based on how tilt up behaves?

Current Behavior

tilt seems to attach to an arbitrary pod that has already terminated (it does not appear to be the most recent or the oldest pod). The output in the logs is:

Attaching to existing pod (db-init-cnbxf). Only new logs will be streamed.

Then tilt ci exits immediately with error: Error: Pod "db-init-cnbxf" failed.

Steps to Reproduce

Given these files:

script.sh

#!/usr/bin/env sh 
date
echo some output
exit 1

Tiltfile

load('ext://deployment', 'job_create')
docker_build('db-init', '.', dockerfile_contents="""
FROM busybox
COPY script.sh script.sh
ENTRYPOINT /script.sh
""")
job_create('db-init')

Run tilt ci once, and the output from this job is printed with the date.
Run tilt ci again many times and the pod never runs again (the job controller will recreate the pod occasionally if the spec allows it). tilt ci says its' attaching to the terminated pod, then exits.

Context

About Your Use Case

We create environments for CI ahead of time using tilt ci. When one of those fails due to some flaky test or infrastructure problem we attempt to retry with tilt ci. We've noticed those retries don't end up working most of the time because of this behaviour.

Dec 09 '23 20:12 dnephin

oof, this is tricky. The short version is that this is currently working as designed.

re: "I can't find any docs that say it should work this way" - here's a good doc on tilt's execution model - https://docs.tilt.dev/controlloop. Basically, you can think of it as docker build && kubectl apply && kubectl wait. tilt ci mainly adds exit conditions.

The fundamental problem is that if you kubectl apply a job, and the spec of the Job hasn't changed, then (from Kubernetes' perspective), there's no reason to re-run the job. From the apiserver perspective, the whole contract of apply if that if the spec of an object hasn't changed, then the system should do nothing.

Tilt inherits this behavior -- if the Job hasn't changed, then the Job shouldn't be re-run.

There have been discussions of this over the years (e.g., https://github.com/kubernetes/kubernetes/issues/77396), but lots of stuff relies on this behavior.

I guess the simple workaround right now is to add to your tiltfile like:

if config.tilt_subcommand == 'ci'):
  local('./clean-up-old-jobs.sh')

though i agree that's unsatisfying :(

Dec 11 '23 15:12 nicks