kamal icon indicating copy to clipboard operation
kamal copied to clipboard

kamal deploy hangs indefinitely after "Pull app image" finishes

Open jyc opened this issue 6 months ago • 3 comments

Thanks for making Kamal! I find it really useful. Unfortunately sometimes Kamal hangs after the "Pull app image" step and before the "Ensure kamal-proxy is running" step. I don't have a reliable repro.

2025-06-19T02:33:34.4507434Z ##[group]Run kamal deploy --skip-push --version=hash
2025-06-19T02:33:34.4508122Z kamal deploy --skip-push --version=hash
2025-06-19T02:33:34.4556646Z shell: /usr/bin/bash -e {0}
2025-06-19T02:33:34.4556943Z env:
2025-06-19T02:33:34.4557180Z   ...
2025-06-19T02:33:34.4562015Z ##[endgroup]
2025-06-19T02:33:34.8053377Z Pull app image...
2025-06-19T02:33:37.4534876Z   INFO [182f36de] Running docker login ghcr.io -u [REDACTED] -p [REDACTED] on IP.ADDRESS
2025-06-19T02:33:37.4536908Z   INFO [182f36de] Finished in 2.639 seconds with exit status 0 (successful).
2025-06-19T02:48:37.3961824Z ##[error]The operation was canceled.

Note the 15 minutes after "Finished in 2.639 seconds" and before I cancelled the job manually.

Looking at main.rb I'm guessing either with_lock must be hanging because I have no pre-deploy hook and I never see the output from say "Ensure kamal-proxy is running...":

https://github.com/basecamp/kamal/blob/3cf510bc8f2b9ff48487ef1e9ee787e149bd7814/lib/kamal/cli/main.rb#L23-L47

Looking at with_lock, it seems like I should be seeing "Acquiring the deploy lock..." output unless it's stuck on ensure_run_directory?

https://github.com/basecamp/kamal/blob/3cf510bc8f2b9ff48487ef1e9ee787e149bd7814/lib/kamal/cli/base.rb#L104-L113

jyc avatar Jun 19 '25 02:06 jyc

Hm. So for a run that didn't hang I actually see more output before "Acquiring the deploy lock":

Run kamal deploy --skip-push --version=hash
Pull app image...
  INFO [e7a736ab] Running docker login ghcr.io -u [REDACTED] -p [REDACTED] on IP.ADDRESS
  INFO [e7a736ab] Finished in 2.643 seconds with exit status 0 (successful).
  INFO [304f9442] Running docker image rm --force ghcr.io/foo/bar:hash on IP.ADDRESS
  INFO [304f9442] Finished in 0.578 seconds with exit status 0 (successful).
  INFO [56180f7a] Running docker pull ghcr.io/foo/bar:hash on IP.ADDRESS
  INFO [56180f7a] Finished in 88.898 seconds with exit status 0 (successful).
  INFO [59ae3542] Running docker inspect -f '{{ .Config.Labels.service }}' ghcr.io/foo/bar:hash | grep -x bar || (echo "Image ghcr.io/foo/bar:hash is missing the 'service' label" && exit 1) on IP.ADDRESS
  INFO [59ae3542] Finished in 0.555 seconds with exit status 0 (successful).
  INFO [5e979212] Running /usr/bin/env mkdir -p .kamal on IP.ADDRESS
  INFO [5e979212] Finished in 0.554 seconds with exit status 0 (successful).
Acquiring the deploy lock...
Ensure kamal-proxy is running...

So I guess it's actually hanging before it outputs Running docker image rm?

jyc avatar Jun 19 '25 03:06 jyc

@jyc - you could try installing rbspy and extracting a stacktrace from the stuck process with sudo rbspy snapshot --pid $PID?

djmb avatar Jun 19 '25 07:06 djmb

Will do, thanks for the idea! I normally run Kamal as part of a GitHub Actions workflow so it'll be a little tricky. But I think what I can do is put kamal depoy and (sleep ... && rbspy) as concurrent jobs in a Bash script so that it'll run itself if it gets stuck. Might be a while until I get a repro but I'll report back ASAP!

jyc avatar Jun 19 '25 08:06 jyc