tilt-extensions icon indicating copy to clipboard operation
tilt-extensions copied to clipboard

`helm_resource` did not wait for `resource_deps` on `helm_repo`

Open nschmeller opened this issue 2 years ago • 6 comments

Hi all, I bumped into a race condition recently in a CI run where helm_resource was supposed to wait for helm_repo to finish but did not. Here are some logs from what happened:

vm-charts │ Initial Build
grafana-char… │ 
grafana-char… │ Initial Build
grafana-char… │ Running cmd: helm repo add grafana-charts https://grafana.github.io/helm-charts --force-update
    vm-charts │ Running cmd: helm repo add vm-charts https://victoriametrics.github.io/helm-charts --force-update
ingress-char… │ 
ingress-char… │ Initial Build
ingress-char… │ Running cmd: helm repo add ingress-charts https://kubernetes.github.io/ingress-nginx --force-update
    vm-charts │ "vm-charts" has been added to your repositories
   vmoperator │ 
   vmoperator │ Initial Build
   vmoperator │ STEP 1/1 — Deploying
   vmoperator │      Running cmd: python3 /home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py
   vmoperator │      Running cmd: ['helm', 'upgrade', '--install', 'vmoperator', 'vm-charts/victoria-metrics-operator']
   vmoperator │      Release "vmoperator" does not exist. Installing it now.
ingress-char… │ "ingress-charts" has been added to your repositories
      grafana │ 
      grafana │ Initial Build
      grafana │ STEP 1/1 — Deploying
      grafana │      Running cmd: python3 /home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py --values grafana/values.yaml
      grafana │      Running cmd: ['helm', 'upgrade', '--install', '--values', 'grafana/values.yaml', 'grafana', 'grafana-charts/grafana']
      grafana │      Release "grafana" does not exist. Installing it now.
      grafana │      Error: repo grafana-charts not found
      grafana │      Traceback (most recent call last):
      grafana │        File "/home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py", line 66, in <module>
      grafana │          subprocess.check_call(install_cmd, stdout=sys.stderr)
      grafana │        File "/usr/lib/python3.10/subprocess.py", line [36](https://github.com/lacework/agent/actions/runs/5081096334/jobs/9137856036#step:7:37)9, in check_call
      grafana │          raise CalledProcessError(retcode, cmd)
      grafana │      subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--values', 'grafana/values.yaml', 'grafana', 'grafana-charts/grafana']' returned non-zero exit status 1.
      grafana │ 
      grafana │ ERROR: Build Failed: apply command exited with status 1

The Tiltfile section that caused that is

# Install Grafana Helm Charts
helm_repo(name='grafana-charts',
          url='https://grafana.github.io/helm-charts')
helm_resource(name='grafana',
              chart='grafana-charts/grafana',
              flags=['--values', './dev/grafana/values.yaml'],
              port_forwards=[3000])
watch_file('./dev/grafana/values.yaml')

Here, grafana has a dependency on grafana-charts, but in the logs above, the chart was not found!

I went ahead and did some triaging in the hopes that it would be useful. My suspicion for what happened:

  • helm_repo uses the Tiltfile local_resource API (https://github.com/tilt-dev/tilt-extensions/blob/13280048812f95896150b6fcf23462bb501d250d/helm_resource/Tiltfile#L158)
  • However, local_resource returns "ready" to Tilt only when the process is started (and not when it is finished!), ctx: https://docs.tilt.dev/local_resource.html#readiness_probe.
  • That means there is a potential race where:
    • helm_repo is called and starts the helm repo add
    • helm_resource, which is dutifully waiting on helm_repo to signal "ready", gets the all-clear from Tilt as the process was started
    • helm_resource attempts to install the chart
    • helm_resource crashes when it does not find the chart
    • helm_repo finishes installing the chart, alas too late

I propose that helm_repo's call to local_resource adds a new readiness_probe which uses an exec_action of helm repo list -o json | jq -e 'map(select(.name == <name-of-chart>)) != []' to determine success of chart installation.

Let me know if this all makes sense please!

nschmeller avatar May 25 '23 22:05 nschmeller

I think you're misreading the docs.

local_resource readiness checks apply to the serve_cmd, not the build cmd. The helm_repo function uses the build cmd. I didn't see anything in your logs to suggest grafana deployed before grafana-charts finished.

nicks avatar May 25 '23 22:05 nicks

Gotcha, thanks for the clarification -- could there be another reason why this would be flaky / racy?

nschmeller avatar May 25 '23 22:05 nschmeller

You say you expected it to wait on resource_deps, but I don't see any resource_deps in the code snippet you posted?

nicks avatar May 25 '23 22:05 nicks

My bad, I pasted the wrong snippet from the Tiltfile 😅 -- the failed CI run happened after I committed code with the resource_deps added:

diff --git a/monitoring/Tiltfile b/monitoring/Tiltfile
index 30cb8f76e4..e0a3696cd7 100644
--- a/monitoring/Tiltfile
+++ b/monitoring/Tiltfile
@@ -56,6 +56,7 @@ helm_repo(name='grafana-charts',
 helm_resource(name='grafana',
               chart='grafana-charts/grafana',
               flags=['--values', './dev/grafana/values.yaml'],
+              resource_deps=['grafana-charts'],
               port_forwards=[3000])
 watch_file('./dev/grafana/values.yaml')

So the state of the Tiltfile when this happened was actually

# Install Grafana Helm Charts
helm_repo(name='grafana-charts',
          url='https://grafana.github.io/helm-charts')
helm_resource(name='grafana',
              chart='grafana-charts/grafana',
              flags=['--values', './dev/grafana/values.yaml'],
              resource_deps=['grafana-charts'],
              port_forwards=[3000])
watch_file('./dev/grafana/values.yaml')

Sorry again for the confusion -- any ideas still?

nschmeller avatar May 25 '23 23:05 nschmeller

I'm not 100% confident 'helm repos add' is concurrency safe. From your logs, it looks like two repo-adds (repos-add?) ran at the same time, but only one printed, which is a bit suspicious. Maybe we could try removing the allow_parallel param from helm_repo and see if that helps?

nicks avatar May 25 '23 23:05 nicks

I'm not sure how to force the race, but I added the problematic code to the test Tiltfile in helm_resource in the hopes of reproducing. Let me know how that PR looks!

nschmeller avatar May 25 '23 23:05 nschmeller