crc icon indicating copy to clipboard operation
crc copied to clipboard

[BUG] After stopping CRC the Kube context is left in inconsistent state causing timeouts

Open deboer-tim opened this issue 5 years ago • 11 comments
trafficstars

General information

  • OS: macOS
  • Hypervisor: hyperkit
  • Did you run crc setup before starting it (Yes/No)? Yes
  • Running CRC on: Laptop

CRC version

CodeReady Containers version: 1.15.0+e317bed OpenShift version: 4.5.7 (embedded in binary)

CRC status

DEBU CodeReady Containers version: 1.15.0+e317bed DEBU OpenShift version: 4.5.7 (embedded in binary) CRC VM: Stopped OpenShift: Stopped Disk Usage: 0B of 0B (Inside the CRC VM) Cache Usage: 12.8GB Cache Directory: /Users/deboer/.crc/cache

CRC config

no output

Host Operating System

ProductName: Mac OS X ProductVersion: 10.15.6 BuildVersion: 19G2021

Steps to reproduce

  1. crc start
  2. crc stop
  3. kubectl get pods, odo push, or basically anything that uses the kube context

Expected

If I connect to a remote OpenShift cluster or use other local Kube tools and then disconnect/stop, the Kube context is left pointing to a cluster that I can't connect to anymore, but it 'fails fast': tools that try to connect fail immediately.

e.g. after stopping minikube and running 'kubectl get pods' it immediately responds with: The connection to the server localhost:8080 was refused - did you specify the right host or port? I expect CRC to have the same behaviour.

Actual

After stopping CRC the Kube context is left pointing to a cluster (api-crc-testing or api.crc.testing) on a bridge network (192.168.*). For some reason clients can't tell this host doesn't exist anymore and connections to it don't fail fast, which eventually causes timeouts on the client side. This is bad enough with kubectl (20s timeout?), but odo has an even longer timeout (4min?) which makes it unusable and appear to hang.

When stopping CRC please remove the kube context, remove the bridge network, remove the host resolution, or do something similar so that clients can tell it doesn't exist or will fail immediately trying to connect.

deboer-tim avatar Oct 02 '20 15:10 deboer-tim

When stopping CRC please remove the kube context, remove the bridge network, remove the host resolution, or do something similar so that clients can tell it doesn't exist or will fail immediately trying to connect.

@praveenkumar any idea what causes the response not to reply 'Host unreachable' or 'Connection refused'? Also, would removing the context be possible?

gbraad avatar Oct 06 '20 08:10 gbraad

I tested this on linux, will check on the mac also but I didn't get that much waiting time as described in the issue.

$ oc whoami
kube:admin

$ crc stop
INFO Stopping the OpenShift cluster, this may take a few minutes... 
Stopped the OpenShift cluster

$ time oc whoami -v=10
I1007 14:03:42.797261  693344 loader.go:375] Config loaded from file:  /home/prkumar/.kube/config
I1007 14:03:42.798023  693344 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: oc/openshift (linux/amd64) kubernetes/d7f3ccf" -H "Authorization: Bearer oUurQFo7e5xjPoz1h3QPFUGVBLL8tEaXBquoz9oaans" 'https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~'
I1007 14:03:45.905233  693344 round_trippers.go:443] GET https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~  in 3107 milliseconds
I1007 14:03:45.905329  693344 round_trippers.go:449] Response Headers:
I1007 14:03:45.905665  693344 helpers.go:234] Connection error: Get https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~: dial tcp 192.168.130.11:6443: connect: no route to host
F1007 14:03:45.905769  693344 helpers.go:115] Unable to connect to the server: dial tcp 192.168.130.11:6443: connect: no route to host

real	0m3.233s
user	0m0.152s
sys	0m0.038s

$ time odo version -v=9
I1007 14:05:06.924601  693547 preference.go:165] The path for preference file is /home/prkumar/.odo/preference.yaml
I1007 14:05:06.924638  693547 occlient.go:448] Trying to connect to server api.crc.testing:6443
I1007 14:05:07.925073  693547 occlient.go:451] unable to connect to server: dial tcp 192.168.130.11:6443: i/o timeout
odo v1.1.3 (44440eeac)

real	0m1.106s
user	0m0.138s
sys	0m0.038s

praveenkumar avatar Oct 07 '20 08:10 praveenkumar

What I see is below - when context is to stopped docker-desktop (or any other context) it fails fast. CRC contexts are fine while using it, but timeouts after I stop CRC. Interestingly enough, if I switch context to Minikube immediately after running CRC I see the same problem - but if I start Minikube and stop it the problem goes away. This leads me to think there is some hyperkit/network cleanup that Minikube is doing but CRC is not.

deboer-mac:crc-macos-1.15.0-amd64 deboer$ kubectl config use-context docker-desktop
Switched to context "docker-desktop".
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
The connection to the server kubernetes.docker.internal:6443 was refused - did you specify the right host or port?

real	0m0.062s
user	0m0.057s
sys	0m0.017s
deboer-mac:crc-macos-1.15.0-amd64 deboer$ ./crc start
...
Started the OpenShift cluster
WARN The cluster might report a degraded or error state. This is expected since several operators have been disabled to lower the resource usage. For more information, please consult the documentation
deboer-mac:crc-macos-1.15.0-amd64 deboer$ kubectl config use-context crc-admin
Switched to context "crc-admin".
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
No resources found in default namespace.

real	0m2.165s
user	0m0.145s
sys	0m0.175s
deboer-mac:crc-macos-1.15.0-amd64 deboer$ ./crc stop
Stopping the OpenShift cluster, this may take a few minutes...
Stopped the OpenShift cluster
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
Unable to connect to the server: dial tcp 192.168.64.2:6443: i/o timeout

real	0m30.209s
user	0m0.101s
sys	0m0.063s

deboer-tim avatar Oct 07 '20 15:10 deboer-tim

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 06 '20 17:12 stale[bot]

Found this bug entry after running into the same issue on my Mac with CRC 1.20.0. Running "kubectl get pods" failed with "Unable to connect to the server: dial tcp 192.168.64.2:6443: i/o timeout" after stopping CRC and logging into another k8s cluster. Thanks to @deboer-tim 's comment above I found I could fix the issue as follows:

  1. Determine current context: kubectl config current-context This was "sample-app/api-crc-testing:6443/kube:admin" for me.

  2. Get list of current contexts and take note of the one you want to use: kubectl config get-contexts

  3. Switch to that context: kubectl config use-context context-name Yup, use-context, not set-context which does something different.

After this kubectl get pods again worked as expected.

cbolik avatar Jan 14 '21 09:01 cbolik

I would like to look into this issue. Could someone please assign it to me?

rohanKanojia avatar Oct 10 '24 15:10 rohanKanojia

I would like to look into this issue. Could someone please assign it to me?

Done

praveenkumar avatar Oct 10 '24 15:10 praveenkumar

I can reproduce this issue. When I also do crc stop and try to access pods using kubectl get pods I get these errors after some wait:

E1010 21:50:00.401494  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": net/http: TLS handshake timeout
E1010 21:50:32.402863  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35508->127.0.0.1:6443: read: connection reset by peer
E1010 21:51:04.403878  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:54090->127.0.0.1:6443: read: connection reset by peer
E1010 21:51:36.405070  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:34104->127.0.0.1:6443: read: connection reset by peer
E1010 21:52:08.406982  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:58892->127.0.0.1:6443: read: connection reset by peer
error: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:58892->127.0.0.1:6443: read: connection reset by peer

I think this issue is happening because crc is not cleaning up current-context field in ~/.kube/config. Here is my observation for behavior of crc and minikube start/stop commands with kubeconfig:

CRC

  • current context in kubeconfig after crc start
    current-context: default/api-crc-testing:6443/kubeadmin
    
  • current context in kubeconfig after crc stop
    current-context: default/api-crc-testing:6443/kubeadmin
    

Minikube

  • current context in kubeconfig after minikube start
    current-context: minikube
    
  • current context in kubeconfig after minikube stop
    current-context: ""
    

It seems crc does not perform clean up in kubeconfig during crc stop command. I do see code for cleaning up kubeconfig : https://github.com/crc-org/crc/blob/5611baa4fc9614f838da088fe72f80a369a4fe9d/pkg/crc/machine/kubeconfig.go#L230

It gets invoked in crc delete command here: https://github.com/crc-org/crc/blob/5611baa4fc9614f838da088fe72f80a369a4fe9d/pkg/crc/machine/delete.go#L38

When I compare it with minikube, minikube seems to be cleaning up kubeconfig in case of both stop and delete commands:

I see these two ways to solve this issue:

  • Make the behavior of crc consistent with minikube, also invoke cleanKubeconfig method while stopping cluster.
  • While stopping the cluster, only set current-context field in kubeconfig to "". Keep Clusters, AuthInfos and Contexts inside the kubeconfig.

rohanKanojia avatar Oct 10 '24 17:10 rohanKanojia

I see these two ways to solve this issue:

* Make the behavior of `crc` consistent with `minikube`, also invoke `cleanKubeconfig` method while stopping cluster.

* While stopping the cluster, only set `current-context` field in kubeconfig to `""`. Keep `Clusters`, `AuthInfos` and `Contexts` inside the kubeconfig.

If it's easy to regenerate Clusters, AuthInfos and Contexts on cluster start, we can go with the first option and remove everything, especially if the code for that already exists.

cfergeau avatar Oct 11 '24 09:10 cfergeau

I have made the changes (https://github.com/rohankanojia-forks/crc/commit/473485b47f94262cac9e1004e65a54ec163a0633) but I'm seeing a strange behavior (not sure if it's due to my code changes or whether I'm testing it incorrectly)

When I do crc start after crc stop (that has cleaned up kubeconfig), I get this error:

# Stop CRC cluster
/home/rokumar/go/src/github.com/crc-org/crc/out/linux-amd64/crc stop
INFO Stopping the instance, this may take a few minutes... 
Stopped the instance

# Check kube config
~ : $ cat .kube/config 
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null
# Start cluster again
~ : $ /home/rokumar/go/src/github.com/crc-org/crc/out/linux-amd64/crc start
WARN A new version (2.42.0) has been published on https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.42.0/crc-linux-amd64.tar.xz 
INFO Using bundle path /home/rokumar/.crc/cache/crc_okd_libvirt_4.15.0-0.okd-2024-02-23-163410_amd64.crcbundle 
INFO Checking if running as non-root              
INFO Checking if running inside WSL2              
INFO Checking if crc-admin-helper executable is cached 
INFO Checking if running on a supported CPU architecture 
INFO Checking if crc executable symlink exists    
INFO Checking minimum RAM requirements            
INFO Check if Podman binary exists in: /home/rokumar/.crc/bin/oc 
INFO Checking if Virtualization is enabled        
INFO Checking if KVM is enabled                   
INFO Checking if libvirt is installed             
INFO Checking if user is part of libvirt group    
INFO Checking if active user/process is currently part of the libvirt group 
INFO Checking if libvirt daemon is running        
INFO Checking if a supported libvirt version is installed 
INFO Checking if crc-driver-libvirt is installed  
INFO Checking crc daemon systemd socket units     
INFO Checking if vsock is correctly configured    
WARN Preflight checks failed during `crc start`, please try to run `crc setup` first in case you haven't done so yet 
capabilities are not correct for /home/rokumar/go/src/github.com/crc-org/crc/out/linux-amd64/crc

When using regular crc I'm able to start cluster successfully .

rohanKanojia avatar Oct 11 '24 17:10 rohanKanojia

@rohanKanojia you need to perform crc setup every single time when you build new binary.

praveenkumar avatar Oct 16 '24 08:10 praveenkumar