crc
crc copied to clipboard
[BUG] After stopping CRC the Kube context is left in inconsistent state causing timeouts
General information
- OS: macOS
- Hypervisor: hyperkit
- Did you run
crc setupbefore starting it (Yes/No)? Yes - Running CRC on: Laptop
CRC version
CodeReady Containers version: 1.15.0+e317bed OpenShift version: 4.5.7 (embedded in binary)
CRC status
DEBU CodeReady Containers version: 1.15.0+e317bed DEBU OpenShift version: 4.5.7 (embedded in binary) CRC VM: Stopped OpenShift: Stopped Disk Usage: 0B of 0B (Inside the CRC VM) Cache Usage: 12.8GB Cache Directory: /Users/deboer/.crc/cache
CRC config
no output
Host Operating System
ProductName: Mac OS X ProductVersion: 10.15.6 BuildVersion: 19G2021
Steps to reproduce
- crc start
- crc stop
- kubectl get pods, odo push, or basically anything that uses the kube context
Expected
If I connect to a remote OpenShift cluster or use other local Kube tools and then disconnect/stop, the Kube context is left pointing to a cluster that I can't connect to anymore, but it 'fails fast': tools that try to connect fail immediately.
e.g. after stopping minikube and running 'kubectl get pods' it immediately responds with:
The connection to the server localhost:8080 was refused - did you specify the right host or port?
I expect CRC to have the same behaviour.
Actual
After stopping CRC the Kube context is left pointing to a cluster (api-crc-testing or api.crc.testing) on a bridge network (192.168.*). For some reason clients can't tell this host doesn't exist anymore and connections to it don't fail fast, which eventually causes timeouts on the client side. This is bad enough with kubectl (20s timeout?), but odo has an even longer timeout (4min?) which makes it unusable and appear to hang.
When stopping CRC please remove the kube context, remove the bridge network, remove the host resolution, or do something similar so that clients can tell it doesn't exist or will fail immediately trying to connect.
When stopping CRC please remove the kube context, remove the bridge network, remove the host resolution, or do something similar so that clients can tell it doesn't exist or will fail immediately trying to connect.
@praveenkumar any idea what causes the response not to reply 'Host unreachable' or 'Connection refused'? Also, would removing the context be possible?
I tested this on linux, will check on the mac also but I didn't get that much waiting time as described in the issue.
$ oc whoami
kube:admin
$ crc stop
INFO Stopping the OpenShift cluster, this may take a few minutes...
Stopped the OpenShift cluster
$ time oc whoami -v=10
I1007 14:03:42.797261 693344 loader.go:375] Config loaded from file: /home/prkumar/.kube/config
I1007 14:03:42.798023 693344 round_trippers.go:423] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: oc/openshift (linux/amd64) kubernetes/d7f3ccf" -H "Authorization: Bearer oUurQFo7e5xjPoz1h3QPFUGVBLL8tEaXBquoz9oaans" 'https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~'
I1007 14:03:45.905233 693344 round_trippers.go:443] GET https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~ in 3107 milliseconds
I1007 14:03:45.905329 693344 round_trippers.go:449] Response Headers:
I1007 14:03:45.905665 693344 helpers.go:234] Connection error: Get https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~: dial tcp 192.168.130.11:6443: connect: no route to host
F1007 14:03:45.905769 693344 helpers.go:115] Unable to connect to the server: dial tcp 192.168.130.11:6443: connect: no route to host
real 0m3.233s
user 0m0.152s
sys 0m0.038s
$ time odo version -v=9
I1007 14:05:06.924601 693547 preference.go:165] The path for preference file is /home/prkumar/.odo/preference.yaml
I1007 14:05:06.924638 693547 occlient.go:448] Trying to connect to server api.crc.testing:6443
I1007 14:05:07.925073 693547 occlient.go:451] unable to connect to server: dial tcp 192.168.130.11:6443: i/o timeout
odo v1.1.3 (44440eeac)
real 0m1.106s
user 0m0.138s
sys 0m0.038s
What I see is below - when context is to stopped docker-desktop (or any other context) it fails fast. CRC contexts are fine while using it, but timeouts after I stop CRC. Interestingly enough, if I switch context to Minikube immediately after running CRC I see the same problem - but if I start Minikube and stop it the problem goes away. This leads me to think there is some hyperkit/network cleanup that Minikube is doing but CRC is not.
deboer-mac:crc-macos-1.15.0-amd64 deboer$ kubectl config use-context docker-desktop
Switched to context "docker-desktop".
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
The connection to the server kubernetes.docker.internal:6443 was refused - did you specify the right host or port?
real 0m0.062s
user 0m0.057s
sys 0m0.017s
deboer-mac:crc-macos-1.15.0-amd64 deboer$ ./crc start
...
Started the OpenShift cluster
WARN The cluster might report a degraded or error state. This is expected since several operators have been disabled to lower the resource usage. For more information, please consult the documentation
deboer-mac:crc-macos-1.15.0-amd64 deboer$ kubectl config use-context crc-admin
Switched to context "crc-admin".
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
No resources found in default namespace.
real 0m2.165s
user 0m0.145s
sys 0m0.175s
deboer-mac:crc-macos-1.15.0-amd64 deboer$ ./crc stop
Stopping the OpenShift cluster, this may take a few minutes...
Stopped the OpenShift cluster
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
Unable to connect to the server: dial tcp 192.168.64.2:6443: i/o timeout
real 0m30.209s
user 0m0.101s
sys 0m0.063s
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Found this bug entry after running into the same issue on my Mac with CRC 1.20.0. Running "kubectl get pods" failed with "Unable to connect to the server: dial tcp 192.168.64.2:6443: i/o timeout" after stopping CRC and logging into another k8s cluster. Thanks to @deboer-tim 's comment above I found I could fix the issue as follows:
-
Determine current context: kubectl config current-context This was "sample-app/api-crc-testing:6443/kube:admin" for me.
-
Get list of current contexts and take note of the one you want to use: kubectl config get-contexts
-
Switch to that context: kubectl config use-context context-name Yup, use-context, not set-context which does something different.
After this kubectl get pods again worked as expected.
I would like to look into this issue. Could someone please assign it to me?
I would like to look into this issue. Could someone please assign it to me?
Done
I can reproduce this issue. When I also do crc stop and try to access pods using kubectl get pods I get these errors after some wait:
E1010 21:50:00.401494 159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": net/http: TLS handshake timeout
E1010 21:50:32.402863 159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35508->127.0.0.1:6443: read: connection reset by peer
E1010 21:51:04.403878 159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:54090->127.0.0.1:6443: read: connection reset by peer
E1010 21:51:36.405070 159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:34104->127.0.0.1:6443: read: connection reset by peer
E1010 21:52:08.406982 159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:58892->127.0.0.1:6443: read: connection reset by peer
error: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:58892->127.0.0.1:6443: read: connection reset by peer
I think this issue is happening because crc is not cleaning up current-context field in ~/.kube/config. Here is my observation for behavior of crc and minikube start/stop commands with kubeconfig:
CRC
- current context in kubeconfig after
crc startcurrent-context: default/api-crc-testing:6443/kubeadmin - current context in kubeconfig after
crc stopcurrent-context: default/api-crc-testing:6443/kubeadmin
Minikube
- current context in kubeconfig after
minikube startcurrent-context: minikube - current context in kubeconfig after
minikube stopcurrent-context: ""
It seems crc does not perform clean up in kubeconfig during crc stop command. I do see code for cleaning up kubeconfig :
https://github.com/crc-org/crc/blob/5611baa4fc9614f838da088fe72f80a369a4fe9d/pkg/crc/machine/kubeconfig.go#L230
It gets invoked in crc delete command here:
https://github.com/crc-org/crc/blob/5611baa4fc9614f838da088fe72f80a369a4fe9d/pkg/crc/machine/delete.go#L38
When I compare it with minikube, minikube seems to be cleaning up kubeconfig in case of both stop and delete commands:
I see these two ways to solve this issue:
- Make the behavior of
crcconsistent withminikube, also invokecleanKubeconfigmethod while stopping cluster. - While stopping the cluster, only set
current-contextfield in kubeconfig to"". KeepClusters,AuthInfosandContextsinside the kubeconfig.
I see these two ways to solve this issue:
* Make the behavior of `crc` consistent with `minikube`, also invoke `cleanKubeconfig` method while stopping cluster. * While stopping the cluster, only set `current-context` field in kubeconfig to `""`. Keep `Clusters`, `AuthInfos` and `Contexts` inside the kubeconfig.
If it's easy to regenerate Clusters, AuthInfos and Contexts on cluster start, we can go with the first option and remove everything, especially if the code for that already exists.
I have made the changes (https://github.com/rohankanojia-forks/crc/commit/473485b47f94262cac9e1004e65a54ec163a0633) but I'm seeing a strange behavior (not sure if it's due to my code changes or whether I'm testing it incorrectly)
When I do crc start after crc stop (that has cleaned up kubeconfig), I get this error:
# Stop CRC cluster
/home/rokumar/go/src/github.com/crc-org/crc/out/linux-amd64/crc stop
INFO Stopping the instance, this may take a few minutes...
Stopped the instance
# Check kube config
~ : $ cat .kube/config
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null
# Start cluster again
~ : $ /home/rokumar/go/src/github.com/crc-org/crc/out/linux-amd64/crc start
WARN A new version (2.42.0) has been published on https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.42.0/crc-linux-amd64.tar.xz
INFO Using bundle path /home/rokumar/.crc/cache/crc_okd_libvirt_4.15.0-0.okd-2024-02-23-163410_amd64.crcbundle
INFO Checking if running as non-root
INFO Checking if running inside WSL2
INFO Checking if crc-admin-helper executable is cached
INFO Checking if running on a supported CPU architecture
INFO Checking if crc executable symlink exists
INFO Checking minimum RAM requirements
INFO Check if Podman binary exists in: /home/rokumar/.crc/bin/oc
INFO Checking if Virtualization is enabled
INFO Checking if KVM is enabled
INFO Checking if libvirt is installed
INFO Checking if user is part of libvirt group
INFO Checking if active user/process is currently part of the libvirt group
INFO Checking if libvirt daemon is running
INFO Checking if a supported libvirt version is installed
INFO Checking if crc-driver-libvirt is installed
INFO Checking crc daemon systemd socket units
INFO Checking if vsock is correctly configured
WARN Preflight checks failed during `crc start`, please try to run `crc setup` first in case you haven't done so yet
capabilities are not correct for /home/rokumar/go/src/github.com/crc-org/crc/out/linux-amd64/crc
When using regular crc I'm able to start cluster successfully .
@rohanKanojia you need to perform crc setup every single time when you build new binary.