rancher-desktop Rancher Desktop Intermittently Hangs on Ventura 13.1

Actual Behavior

When running a docker command it will hang forever. Any subsequent commands to docker in another shell hang as well. Rebooting the laptop is required as Rancher Desktop becomes unusable.

Steps to Reproduce

One dev on a M1 Mac, running Ventura 13.1 can reproduce this issue consistently by building a Dockerfile in docker. We however are unable to reproduce the same issue on our laptops consistently. One of our team members reproducing it is using a M1 Mac as well.

Create a Dockerfile

echo -e 'FROM alpine:latest\nRUN echo 'hey' > hey.txt' > Dockerfile

Build Dockerfile in docker

docker run --rm --interactive --pull="always" --user="root" --network="host" --name="repro-hanging-issue" --mount "type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock" -v "$(pwd):$(pwd)" -w "$(pwd)" docker:cli build .

Result

The terminal just hangs.

Expected Behavior

Docker commands not to hang.

Additional Information

We have had our developers start using Rancher Desktop in November 2022. It was working good, no hanging issues reported. Once people started updating to Ventura at the beginning of the month (January) they started reporting these issues. We have one developer who is able to consistently reproduce the issue, some of us can only reproduce it intermittently. Seems to be most reproducible on M1 Mac though. We were also able to reproduce it with our security tools disabled.

We enabled debug logging from the Rancher Desktop Troubleshooting page and looked at all the logs, lima and rancher and did not see any glaring errors or warnings.

If there is anything else we can provide to help this let me know.

Rancher Desktop Version

1.7.0

Rancher Desktop K8s Version

Disabled

Which container engine are you using?

moby (docker cli)

What operating system are you using?

macOS

Operating System / Build Version

Ventura 13.1

What CPU architecture are you using?

arm64 (Apple Silicon)

Linux only: what package format did you use to install Rancher Desktop?

None

Windows User Only

No response

Jan 13 '23 18:01 ryancurrah

Attached are the logs we captured when we re-produced the issue.

rancher-desktop-logs.zip

Jan 13 '23 19:01 ryancurrah

I can't reproduce this on macOS 13.1 on M1 either). I've done a factory reset, rebooted the host, did another factory reset, and the command always worked fine.

I've looked at the logs, and can't spot anything in there either.

On the "reproducible laptop" does this also happen after a factory reset? Or after rebooting the host?

Are there any errors in any of the networking logs at ~/Library/Application Support/rancher-desktop/lima/_networks?

Jan 13 '23 21:01 jandubois

I am getting our IT team to send me an M1 Macbook so I can try to reproduce this issue. Another dev reported the same issue this morning. Not sure what they were doing to cause it though.

On the "reproducible laptop" it happens even after a factory reset, reboot, and fresh re-install.

The dev with the reproducible laptop needs to get some work done so they have uninstalled it for now. ~I am going to get our devs to post here when they get a freezing issue~. Meanwhile, I will try to get that laptop and re-produce it.

Jan 14 '23 20:01 ryancurrah

I am getting our IT team to send me an M1 Macbook so I can try to reproduce this issue. Another dev reported the same issue this morning. Not sure what they were doing to cause it though.

Thank you so much; this will be really helpful, as I've been unable to repro this myself.

Maybe also take a look at any anti-malware technology installed on your machines; maybe that is interfering with the virtualization code?

Jan 16 '23 00:01 jandubois

I have the same problem. I have tried a factory reset, reinstall, reboot everything, but rancher still hangs.

My colleagues who have the same anti-virus software installed did not have the problem.

Jan 16 '23 13:01 yagi2

Hi I 'm able to reproduce this frequently on my M1 running Monterrey 12.6.1/RD 1.7.0/k8s 1.25.4/Traefik disabled. What logs can I provide from ~/Library/Logs/rancher-desktop to help debug this? Currently the RD UI shows Kubernetes is running but kubectl commands timeout with Unable to connect to the server: net/http: TLS handshake timeout

Tried quitting Rancher desktop and restarting a couple of times but same problem. I could restart the laptop and the problem might go away. I may need to do that to not be blocked with my work and/or look to minikube (which doesn't have a nice UI). But happy to provide logs and keep the laptop in this reproducible state for the next 24 hours or so if it helps.

Screen Shot 2023-01-16 at 10 58 52 AM

Jan 16 '23 19:01 lakamsani

tailed logs from the time it started to the time it stopped working.

1. steve.log

time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for rbac.authorization.k8s.io/v1, Kind=RoleBinding"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for apiregistration.k8s.io/v1, Kind=APIService"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for /v1, Kind=Pod"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for apps/v1, Kind=Deployment"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for events.k8s.io/v1, Kind=Event"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for /v1, Kind=PodTemplate"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for apps/v1, Kind=StatefulSet"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for batch/v1, Kind=CronJob"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for acme.cert-manager.io/v1, Kind=Order"
…
….. first sign of trouble ….
….

2023-01-16T19:10:04.881Z: stderr: time="2023-01-16T11:10:04-08:00" level=error msg="Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]"

2023-01-16T19:13:01.329Z: stderr: W0116 11:13:01.327098   46860 reflector.go:443] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0116 11:13:01.327114   46860 reflector.go:443] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
….
…. many of these …..
….
W0116 11:13:01.328829   46860 reflector.go:443] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0116 11:13:01.328880   46860 reflector.go:443] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

….
…. TLS handshake timeouts. After this kubectl stops working roughly …..
….

2023-01-16T19:13:12.133Z: stderr: W0116 11:13:12.132748   46860 reflector.go:325] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:168: failed to list *summary.SummarizedObject: Get "https://127.0.0.1:6443/apis/cert-manager.io/v1/certificates?resourceVersion=160294": net/http: TLS handshake timeout
W0116 11:13:12.132851   46860 reflector.go:325] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:168: failed to list *summary.SummarizedObject: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?resourceVersion=160231": net/http: TLS handshake timeout
I0116 11:13:12.132905   46860 trace.go:205] Trace[631373749]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:168 (16-Jan-2023 11:13:02.130) (total time: 10002ms):
Trace[631373749]: ---"Objects listed" error:Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?resourceVersion=160231": net/http: TLS handshake timeout 10002ms (11:13:12.132)
Trace[631373749]: [10.002143209s] [10.002143209s] END

Jan 17 '23 05:01 lakamsani

2. k3s.log

E0117 04:26:35.226050    4290 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
W0117 04:26:36.046392    4290 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
E0117 04:26:36.046516    4290 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
{"level":"warn","ts":"2023-01-17T04:26:36.183Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x400167d880/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
E0117 04:26:36.183408    4290 controller.go:187] failed to update lease, error: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/lima-rancher-desktop?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0117 04:26:36.183651    4290 writers.go:118] apiserver was unable to write a JSON response: http: Handler timeout
E0117 04:26:36.185775    4290 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
I0117 04:26:36.185091    4290 trace.go:205] Trace[333656479]: "GuaranteedUpdate etcd3" audit-id:0a94d052-49c1-40c2-a1f3-8bdacccbd6e9,key:/leases/kube-node-lease/lima-rancher-desktop,type:*coordination.Lease (17-Jan-2023 04:26:26.184) (total time: 10000ms):
Trace[333656479]: ---"Txn call finished" err:context deadline exceeded 9999ms (04:26:36.185)
Trace[333656479]: [10.000193713s] [10.000193713s] END
E0117 04:26:36.197602    4290 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 13.941958ms, panicked: false, err: context deadline exceeded, panic-reason: <nil>
E0117 04:26:36.196928    4290 writers.go:131] apiserver was unable to write a fallback JSON response: http: Handler timeout
I0117 04:26:36.199085    4290 trace.go:205] Trace[1183966381]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/lima-rancher-desktop,user-agent:k3s/v1.25.4+k3s1 (linux/arm64) kubernetes/0dc6333,audit-id:0a94d052-49c1-40c2-a1f3-8bdacccbd6e9,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf,application/json,protocol:HTTP/2.0 (17-Jan-2023 04:26:26.183) (total time: 10015ms):
Trace[1183966381]: ---"Write to database call finished" len:509,err:Timeout: request did not complete within requested timeout - context deadline exceeded 9998ms (04:26:36.183)
Trace[1183966381]: [10.015928213s] [10.015928213s] END
E0117 04:26:36.199699    4290 timeout.go:141] post-timeout activity - time-elapsed: 16.136125ms, PUT "/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/lima-rancher-desktop" result: <nil>

Jan 17 '23 05:01 lakamsani

Note we have been able to avoid this hanging issue by switching to the 9p mount type in Lima. I'm not sure if it completely fixes it or makes it occur less often time will tell by our users. But my suggestion to others affected by this is to try the 9p mount. Caveat though the 9p mount does not support symlinks in volumes.

Jan 19 '23 19:01 ryancurrah

@ryancurrah how do you enable 9p? I read about it here i.e.

On macOS an alternative file sharing mechanism using 9p instead of reverse-sshfs has been implemented. It is disabled by default. Talk to us on Slack if you want to help us testing it.

But wasn't able to find the specific on how to enable it.

Jan 21 '23 01:01 lakamsani

I have the same problem.

In detail, I and co-worker had upgraded the macOS to 13.0 but become producing it. We upgrade to 13.1, his machine recovered, but my machine was not recovered.

Finally, I had recovered by switching mountType to the 9p

Jan 23 '23 10:01 casheeeewnuts

Docker container had run normally with pure Lima that installed by homebrew. But mountType is null.

Jan 23 '23 10:01 casheeeewnuts

@lakamsani edit this file and add entry the mountType to top-level ~/Library/Application Support/rancher-desktop/lima/_config/override.yaml

Jan 23 '23 10:01 casheeeewnuts

I ran into the same issue too, when doing a "pnpm install" in a docker container after mounting a custom workdir into lima, on my macOS 13.1(intel). So I think this is not related to intel or M1. I can exactly reproduce this issue every time by using the same steps. And I also checked logs under rancher desktop, it seems no error(s) logged.

For me, it seems "hang" only occures when using default mountType(should be null, from ~/Library/Application Support/rancher-desktop/lima/0/lima.yaml), and run some npm install commadn inside a docker container with -v custom volmue mount. I also wrote a dockerfile to do almost the same thing to test but the problem disappered. Finally I changed lima mountType to 9p and everyting seems to be ok now.

Jan 23 '23 11:01 lifeym

After upgrading to Ventura 13.2 coming from 12.x. I never ran into this problem on 12.x

I'm running into the same issue. I'm doing a massive amount of file activity along with network inside a container. The IO get's hung, which then docker ps becomes unresponsive. I try to quit the desktop which hangs, to get it to quit properly:

ps auxww |grep rancher | grep ssh  |awk '{print $2}'  | xargs kill

On restart, qemu looks like it comes up properly, but the docker socket is unresponsive still. A second quit and restart works fine. I guess I'll try the 9p thing. I don't have an override.yaml, so I'm assuming it should look like:

---
mountType: 9p

Feb 13 '23 20:02 mterzo

---
mountType: 9p

Answered my own question:

cat ~/"Library/Application Support/rancher-desktop/lima/_config/override.yaml"
---
mountType: 9p

ps auxww |grep rancher | grep ssh shows nothing now while using disk io

Feb 13 '23 21:02 mterzo

Hello, experiencing same issue, but on intel CPU and macOS Ventura....FYI

Feb 14 '23 12:02 atomicbeecz

Hello, experiencing same issue, but on intel CPU and macOS Ventura....FYI

I should have clarified that, I’m on intel as well. The 9p made a huge difference.

Feb 14 '23 14:02 mterzo

Unfortunately for me the 9p caused other issues so it's unusable for me.

Feb 14 '23 14:02 atomicbeecz

update: upgraded to Ventura 13.2 and don't have the "freezing" problem anymore without any override...

Feb 14 '23 20:02 atomicbeecz

Meet the same hang problem on 13.2 on Intel mac, docker freezing, can't quick rancher-desktop.

Mar 03 '23 02:03 lynic

Meet the same hang problem on 13.2 on Intel mac, docker freezing, can't quick rancher-desktop.

I’m a terminal do a ps and grep for rancher. You will see a bunch of ssh sessions kill them off and your rancher will become responsive. Once made change to 9p all these hang issues went away.

Mar 03 '23 05:03 mterzo

I’m a terminal do a ps and grep for rancher. You will see a bunch of ssh sessions kill them off and your rancher will become responsive. Once made change to 9p all these hang issues went away.

Thanks, after adding a new override.yaml, it work for me!

cat ~/Library/Application\ Support/rancher-desktop/lima/_config/override.yaml
---
mountType: 9p

Mar 05 '23 03:03 lynic

I have been experiencing a similar problem on and off for the past month or two. Was originally discussing in the rancher-desktop slack channel, but after finding this issue I believe it's the same as what I'm experiencing.

I find the bug to be easily reproducible in my case: Rancher Desktop: 1.8.1 macOS: Ventura 13.1 Container runtime: dockerd (moby) [I have not tested recently with containerd/nerdctl - will try this] Rancher kubernetes: disabled (doesn't matter; I've seen this issue with k8s enabled as well)

I get the same behavior as described above, existing containers freeze and virtually all commands hang (docker ps, docker image ls, rdctl shell, nothing works except simple stuff like docker version).

Here is what I can note about reproducing the problem (at least in my case):

Only happens when running multiple containers simultaneously
Containers are running terraform provisioning via ansible (IO/network usage) in interactive mode (docker run -it) with a few env vars passed in (probably not relevant)
Each container has multiple volumes mounted, but I am careful to never mount the same host volume with read/write to two different containers (sometimes I mount the same volume to multiple containers in read-only)
I increased the RAM allowance for the rancher VM all the way up to 16GB, but this did not help (I have verified that my machine RAM is not being used up either; plenty of capacity left)

About the suggested workaround:

I did attempt the mountType: 9p workaround - it did successfully prevent the container runtime from hanging; however, it caused my terraform provider to fatally crash (everytime), so this method is unusable for me.

Jun 01 '23 12:06 jonhutchens

Same here: Rancher Version: 1.9.1 Ventura 13.4.1 (c)

Aug 27 '23 19:08 ricardochaves

Likewise, Rancher Desktop randomly freezes for me, more often-than-not after I leave it running without use for a while, and most nerdctl nor rdctl commands will respond until I restart the application (tearing down the VM, etc.).

I'm currently on Rancher Desktop 1.9.1 & on macOS Ventura 13.5.1, running on Apple silicon (M2 Pro). I don't have Kubernetes enabled, and I'm using the containerd runtime, with VZ emulation (Rosetta support enabled) & virtiofs mounting (I did have other types of problems before when using 9p, mostly related to user mappings & permissions, so I'd like to avoid going back to that, and reverse-sshfs was unbearably slow!).

Let me know if you'd like me to gather any information when RD hangs, for debugging purposes. Thanks!

Aug 28 '23 21:08 juanpalaciosascend

Same issue here. Exactly same environment as @juanpalaciosascend (but M1 pro)

Sep 07 '23 16:09 agascon

Same for me, factory reset did fix it for me though.

Sep 08 '23 08:09 seagullmouse

Factory reset fixes because it probably sets back to QEMU, reverse-sshfs, ... but if you try to apply those settings mentioned (VZ, virtiofs, ...) back, probably problem will come back.

Sep 08 '23 09:09 agascon

I've seen most of the problems I've been experiencing go away... I want to say entirely, but it might be still a little bit too early for that, when switching back to the dockerd (moby) runtime, away from containerd.

All other settings (e.g. VZ framework, Rosetta support enabled, virtiofs volumes, Kubernetes disabled, etc.) remain the same, so that leads me to believe the problem that's causing Rancher Desktop to freeze revolves around the use of containerd.

Sep 12 '23 18:09 juanpalaciosascend

rancher-desktop rancher-desktop copied to clipboard

Rancher Desktop Intermittently Hangs on Ventura 13.1

Actual Behavior

Steps to Reproduce

Result

Expected Behavior

Additional Information

Rancher Desktop Version

Rancher Desktop K8s Version

Which container engine are you using?

What operating system are you using?

Operating System / Build Version

What CPU architecture are you using?

Linux only: what package format did you use to install Rancher Desktop?

Windows User Only

rancher-desktop
rancher-desktop copied to clipboard