tilt ci --output-snapshot-on-exit can race and capture the snapshot before all state is reconciled
Expected Behavior
I'm expecting that the snapshot saved by --output-snapshot-on-exit contains all the relevant logs and status about the failure.
Current Behavior
In some cases, let's say roughly 1 in 20 failures, we've noticed that the command can fail, but the snapshot doesn't reflect that failure. Two concrete examples:
Example 1 - The tilt ci run fails with error Error: Custom build "custom-build-cmd" failed: exit status 1. When I open the snapshot I don't see any tiles on the left marked as failed. If I look through every one and I eventually find the build failure. It's status is:
"runtimeStatus": "pending",
"updateStatus": "in_progress",
The logs do show the failure.
Example 2 - The tilt ci run fails with Error: exceeded grace period: Pod "some-test-gpv77" failed. This time the runtimeStatus is correctly "error", but the logs are incomplete. It's not that they are truncated due to the buffer. The final logs that contain the error message are what is missing (not earlier logs).
Steps to Reproduce
Other than running a very large number of tilt ci runs on a CI worker I'm not sure how to reliability reproduce this. I assume it's a race condition where the shutdown happens too early before reconciling all the necessary events.
Context
Observed on v0.33.21, not sure when it started. I'll be upgrading to the latest version now, but I assume it hasn't changed since.
About Your Use Case
We use tilt ci in CI to run an environment for end-to-end testing.
I'd be happy to submit a patch for this if you can point me in the right direction (specific files or packages to look at). I'm also happy to run a pre-release build to see if we can reproduce the issue with a patch applied.