tilt-extensions
tilt-extensions copied to clipboard
`helm_resource` did not wait for `resource_deps` on `helm_repo`
Hi all, I bumped into a race condition recently in a CI run where helm_resource was supposed to wait for helm_repo to finish but did not. Here are some logs from what happened:
vm-charts │ Initial Build
grafana-char… │
grafana-char… │ Initial Build
grafana-char… │ Running cmd: helm repo add grafana-charts https://grafana.github.io/helm-charts --force-update
vm-charts │ Running cmd: helm repo add vm-charts https://victoriametrics.github.io/helm-charts --force-update
ingress-char… │
ingress-char… │ Initial Build
ingress-char… │ Running cmd: helm repo add ingress-charts https://kubernetes.github.io/ingress-nginx --force-update
vm-charts │ "vm-charts" has been added to your repositories
vmoperator │
vmoperator │ Initial Build
vmoperator │ STEP 1/1 — Deploying
vmoperator │ Running cmd: python3 /home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py
vmoperator │ Running cmd: ['helm', 'upgrade', '--install', 'vmoperator', 'vm-charts/victoria-metrics-operator']
vmoperator │ Release "vmoperator" does not exist. Installing it now.
ingress-char… │ "ingress-charts" has been added to your repositories
grafana │
grafana │ Initial Build
grafana │ STEP 1/1 — Deploying
grafana │ Running cmd: python3 /home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py --values grafana/values.yaml
grafana │ Running cmd: ['helm', 'upgrade', '--install', '--values', 'grafana/values.yaml', 'grafana', 'grafana-charts/grafana']
grafana │ Release "grafana" does not exist. Installing it now.
grafana │ Error: repo grafana-charts not found
grafana │ Traceback (most recent call last):
grafana │ File "/home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py", line 66, in <module>
grafana │ subprocess.check_call(install_cmd, stdout=sys.stderr)
grafana │ File "/usr/lib/python3.10/subprocess.py", line [36](https://github.com/lacework/agent/actions/runs/5081096334/jobs/9137856036#step:7:37)9, in check_call
grafana │ raise CalledProcessError(retcode, cmd)
grafana │ subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--values', 'grafana/values.yaml', 'grafana', 'grafana-charts/grafana']' returned non-zero exit status 1.
grafana │
grafana │ ERROR: Build Failed: apply command exited with status 1
The Tiltfile section that caused that is
# Install Grafana Helm Charts
helm_repo(name='grafana-charts',
url='https://grafana.github.io/helm-charts')
helm_resource(name='grafana',
chart='grafana-charts/grafana',
flags=['--values', './dev/grafana/values.yaml'],
port_forwards=[3000])
watch_file('./dev/grafana/values.yaml')
Here, grafana has a dependency on grafana-charts, but in the logs above, the chart was not found!
I went ahead and did some triaging in the hopes that it would be useful. My suspicion for what happened:
helm_repouses the Tiltfilelocal_resourceAPI (https://github.com/tilt-dev/tilt-extensions/blob/13280048812f95896150b6fcf23462bb501d250d/helm_resource/Tiltfile#L158)- However,
local_resourcereturns "ready" to Tilt only when the process is started (and not when it is finished!), ctx: https://docs.tilt.dev/local_resource.html#readiness_probe. - That means there is a potential race where:
helm_repois called and starts thehelm repo addhelm_resource, which is dutifully waiting onhelm_repoto signal "ready", gets the all-clear from Tilt as the process was startedhelm_resourceattempts to install the charthelm_resourcecrashes when it does not find the charthelm_repofinishes installing the chart, alas too late
I propose that helm_repo's call to local_resource adds a new readiness_probe which uses an exec_action of helm repo list -o json | jq -e 'map(select(.name == <name-of-chart>)) != []' to determine success of chart installation.
Let me know if this all makes sense please!
I think you're misreading the docs.
local_resource readiness checks apply to the serve_cmd, not the build cmd. The helm_repo function uses the build cmd. I didn't see anything in your logs to suggest grafana deployed before grafana-charts finished.
Gotcha, thanks for the clarification -- could there be another reason why this would be flaky / racy?
You say you expected it to wait on resource_deps, but I don't see any resource_deps in the code snippet you posted?
My bad, I pasted the wrong snippet from the Tiltfile 😅 -- the failed CI run happened after I committed code with the resource_deps added:
diff --git a/monitoring/Tiltfile b/monitoring/Tiltfile
index 30cb8f76e4..e0a3696cd7 100644
--- a/monitoring/Tiltfile
+++ b/monitoring/Tiltfile
@@ -56,6 +56,7 @@ helm_repo(name='grafana-charts',
helm_resource(name='grafana',
chart='grafana-charts/grafana',
flags=['--values', './dev/grafana/values.yaml'],
+ resource_deps=['grafana-charts'],
port_forwards=[3000])
watch_file('./dev/grafana/values.yaml')
So the state of the Tiltfile when this happened was actually
# Install Grafana Helm Charts
helm_repo(name='grafana-charts',
url='https://grafana.github.io/helm-charts')
helm_resource(name='grafana',
chart='grafana-charts/grafana',
flags=['--values', './dev/grafana/values.yaml'],
resource_deps=['grafana-charts'],
port_forwards=[3000])
watch_file('./dev/grafana/values.yaml')
Sorry again for the confusion -- any ideas still?
I'm not 100% confident 'helm repos add' is concurrency safe. From your logs, it looks like two repo-adds (repos-add?) ran at the same time, but only one printed, which is a bit suspicious. Maybe we could try removing the allow_parallel param from helm_repo and see if that helps?
I'm not sure how to force the race, but I added the problematic code to the test Tiltfile in helm_resource in the hopes of reproducing. Let me know how that PR looks!