crc icon indicating copy to clipboard operation
crc copied to clipboard

Openshift local version 4.18.2 goes to unreachable status if the environment is idle for sometime.

Open codersyacht opened this issue 9 months ago • 11 comments

General information

Issue only occur after sometime, like 1 hour of idleness.

[admin@ocp1 ~]$ crc status
CRC VM:          Running
OpenShift:       Unreachable (v4.18.2)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     28.13GB
Cache Directory: /home/admin/.crc/cache
[admin@ocp1 ~]$ 
[admin@ocp1 ~]$ sudo cat /var/log/libvirt/qemu/crc.log
2025-04-15 16:14:19.999+0000: Starting external device: virtiofsd
/usr/libexec/virtiofsd --fd=34 --shared-dir /home/admin
2025-04-15 16:14:20.011+0000: starting up libvirt version: 10.5.0, package: 7.5.el9_5 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2025-01-21-08:08:49, ), qemu version: 9.0.0qemu-kvm-9.0.0-10.el9_5.2, kernel: 5.14.0-427.42.1.el9_4.x86_64, hostname: ocp1.fyre.ibm.com
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
HOME=/var/lib/libvirt/qemu/domain-1-crc \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-crc/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-crc/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-crc/.config \
/usr/libexec/qemu-kvm \
-name guest=crc,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-crc/master-key.aes"}' \
-blockdev '{"driver":"file","filename":"/usr/share/edk2/ovmf/OVMF_CODE.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}' \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/crc_VARS.fd","node-name":"libvirt-pflash1-storage","read-only":false}' \
-machine pc-q35-rhel9.4.0,usb=off,dump-guest-core=off,memory-backend=pc.ram,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-storage,acpi=on \
-accel kvm \
  Booting `Red Hat Enterprise Linux CoreOS 418.94.202502250906-0 (ostree:0)'
[admin@ocp1 ~]$ oc login -u developer https://api.crc.testing:6443
error: dial tcp 127.0.0.1:6443: connect: connection refused - verify you have provided the correct host and port and that the server is currently running.

Operating System

Linux

Hypervisor

KVM

Did you run crc setup before crc start?

yes

Running on

Laptop

Steps to reproduce

A running crc instance abruptly stops after sometime of idleness.

CRC version

4.18.2

CRC status

[admin@ocp1 ~]$ crc status --log-level debug
DEBU CRC version: 2.49.0+e843be                   
DEBU OpenShift version: 4.18.2                    
DEBU MicroShift version: 4.18.2                   
DEBU Running 'crc status'                         
CRC VM:          Running
OpenShift:       Unreachable (v4.18.2)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     28.13GB
Cache Directory: /home/admin/.crc/cache

CRC config

[admin@ocp1 ~]$ crc config view
- consent-telemetry                     : no
- cpus                                  : 8
- disk-size                             : 100
- enable-cluster-monitoring             : true
- memory                                : 32768
- pull-secret-file                      : /home/admin/apps/ocp/pull-secret.txt

Host Operating System

[admin@ocp1 ~]$ cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="9.5 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.5"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.5 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.5
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.5"

Expected behavior

crc should not stop.

Actual behavior

crc abruptly stops

CRC Logs

restart works. But requires restart every 1 hour.

Additional context

No response

codersyacht avatar Apr 15 '25 19:04 codersyacht

Issue noted only when running with network-mode as user. When it is set to system with a HA proxy redirecting request to the openshift IP and port, issue does not occur.

codersyacht avatar Apr 16 '25 02:04 codersyacht

When this happens, can you still ssh into the cluster? https://github.com/crc-org/engineering-docs/blob/main/content/Debugging.md#access-the-vm (the linux instructions are outdated…)

$ ssh -i ~/.crc/machines/crc/id_ed25519 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -p 2222 [email protected]

cfergeau avatar Apr 17 '25 07:04 cfergeau

@cfergeau

Thanks for the quick response. It does not work.

[admin@system1 crc]$ ls -lrt
total 21258324
-rw------- 1 admin admin          81 Apr 22 08:44 id_ed25519.pub
-rw------- 1 admin admin         387 Apr 22 08:44 id_ed25519
-rw------- 1 admin admin          23 Apr 22 08:44 kubeadmin-password
srwxr-xr-x 1 admin admin           0 Apr 22 08:44 docker.sock
-rw------- 1 admin admin         901 Apr 22 08:44 config.json
-rw------- 1 admin admin       15275 Apr 22 08:49 kubeconfig
-rw-r--r-- 1 qemu  qemu  21767782400 Apr 22 21:04 crc.qcow2
[admin@system1 crc]$ 
[admin@system1 crc]$ ssh -i ~/.crc/machines/crc/id_ed25519 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -p 2222 [email protected]
ssh: connect to host 127.0.0.1 port 2222: Connection refused

You can find some explanations for typical errors at this link:
            https://red.ht/support_rhel_ssh
[admin@system1 crc]$ 
[admin@system1 crc]$ crc status
CRC VM:          Running
OpenShift:       Unreachable (v4.18.2)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     28.13GB
Cache Directory: /home/admin/.crc/cache
[admin@system1 crc]$ 

codersyacht avatar Apr 23 '25 04:04 codersyacht

Can you check systemctl --user status crc-daemon.service and also using journalctl --user-unit crc-daemon.service check the logs for that service.

praveenkumar avatar Apr 25 '25 07:04 praveenkumar

Hi @praveenkumar @cfergeau, I've noticed the exact same thing on my setup:

[user@crc-host ~]$ crc status
CRC VM:          Running
OpenShift:       Unreachable (v4.18.2)
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     28.13GB
Cache Directory: /home/user/.crc/cache
[user@crc-host ~]$ oc get pod
The connection to the server api.crc.testing:6443 was refused - did you specify the right host or port?
[user@crc-host ~]$ ssh crc
ssh: connect to host 127.0.0.1 port 2222: Connection refused
[user@crc-host ~]$ cat ~/.ssh/config
Host crc
    Hostname 127.0.0.1
    Port 2222
    User core
    IdentityFile ~/.crc/machines/crc/id_ecdsa
    IdentityFile ~/.crc/machines/crc/id_ed25519
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null

[user@crc-host ~]$

This is the version I'm using:

[user@crc-host ~]$ crc version
CRC version: 2.51.0+80aa80
OpenShift version: 4.18.2
MicroShift version: 4.18.2
[user@crc-host ~]$

The latest version points to CRC version 2.51.0 (above) with OpenShift bundle 4.18.2.

Any ideas why is it stuck / unreachable? It's also impossible to see the console of the VM:

[user@crc-host ~]$ sudo virsh console crc
error: internal error: character device serial0 is not using a PTY

[user@crc-host ~]$

So no logs whatsoever once the RHCOS guest VM enters this state. Very hard to debug and find out what happened.

A work-around is to reboot it every 1h with a cronjob:

crc stop
crc start

But this is obviously far from being ideal.

I'll give an older version, 2.48.0 (with OpenShift 4.18.1) a try and see if it solves it.

Interesting to hear your thoughts on how to debug it!

ofirc avatar Jun 17 '25 20:06 ofirc

When I use an older version, 2.48.0 (with OpenShift 4.18.1) then everything seems to be healthy and it works:

[ofircohen@crc-host ~]$ crc status
CRC VM:          Running
OpenShift:       Running (v4.18.1)
Disk Usage:      27.87GB of 128.2GB (Inside the CRC VM)
Cache Usage:     27.93GB
Cache Directory: /home/ofircohen/.crc/cache
[ofircohen@crc-host ~]$ oc get pod
NAME                                     READY   STATUS    RESTARTS   AGE
wiz-integration-agent-6c4477f6d5-dbpdz   1/1     Running   0          4h15m
[ofircohen@crc-host ~]$ ssh crc uptime
Warning: Permanently added '[127.0.0.1]:2222' (ED25519) to the list of known hosts.
no such identity: /home/ofircohen/.crc/machines/crc/id_ecdsa: No such file or directory
 01:23:21 up  4:27,  0 users,  load average: 1.41, 1.09, 1.21
[ofircohen@crc-host ~]$

What could be the issue? Why des 4.8/12 feeze / get stck? https://developers.redhat.com/content-gateway/rest/mirror/pub/openshift-v4/clients/crc/latest/crc-linux-amd64.tar.xz

Is there. away to debug/troubleshoot them?

Thanks!

ofirc avatar Jun 18 '25 01:06 ofirc

I can confirm that it works with crc 2.48.0 and OpenShift 4.18.1:

[user@crc-host ~]$ ssh crc uptime
Warning: Permanently added '[127.0.0.1]:2222' (ED25519) to the list of known hosts.
no such identity: /home/user/.crc/machines/crc/id_ecdsa: No such file or directory
 11:20:41 up 14:24,  0 users,  load average: 0.45, 0.73, 0.92
[user@crc-host ~]$ crc status
CRC VM:          Running
OpenShift:       Running (v4.18.1)
Disk Usage:      31.65GB of 128.2GB (Inside the CRC VM)
Cache Usage:     27.93GB
Cache Directory: /home/user/.crc/cache
[user@crc-host ~]$ crc version
WARN A new version (2.51.0) has been published on https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.51.0/crc-linux-amd64.tar.xz
CRC version: 2.48.0+1aa46c
OpenShift version: 4.18.1
MicroShift version: 4.18.1
[user@crc-host ~]$

So there seems to be a regression with crc 2.51.0 with OpenShift 4.18.2.

This one works: https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.48.0/crc-linux-amd64.tar.xz

This one doesn't: https://developers.redhat.com/content-gateway/rest/mirror/pub/openshift-v4/clients/crc/latest/crc-linux-amd64.tar.xz

It would be nice to get some more debugging/troubleshooting tooling around the libvirt/qemu VM to be able to fetch some more useful diagnostics.

ofirc avatar Jun 18 '25 11:06 ofirc

As a data point, can you check this https://github.com/crc-org/crc/issues/4730#issuecomment-2829644362 ?

cfergeau avatar Jun 18 '25 12:06 cfergeau

The systemctl --user status crc-daemon.service and journalctl --user-unit crc-daemon.service were fine, there were Input/Output errors on the vsock because it got stuck on the guest VM side of things. No useful diagnostics from the logs / from the daemons, we don't forward systemd-journald or the kernel ring buffer (dmesg) back to the host so I'm afraid it's still a black box.

ofirc avatar Jun 18 '25 13:06 ofirc

I have setup an end to end guide on how to bring up this cluster on GCE: https://www.linkedin.com/posts/cohen-ofir_kubernetes-openshift-rhel-activity-7341589911390539776-cz6G?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAVWybwBi9G-qEFOErHtBmK_-tkEhlPZ_cc

Note that I had to use an extremely generous VM to accommodate the crc guest VM: ⚙️ My Setup (recommended for smoother experience) Host: GCE n2-standard-8 VM (32GiB RAM, 200GiB disk) ↳ Enable nested virtualization ↳ RHEL 9 image (rhel-cloud/rhel-9)

Guest CRC config: crc 2.48.0

curl -LO https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.48.0/crc-linux-amd64.tar.xz
crc config set memory 20000
crc config set cpus 8
crc config set disk-size 120

crc setup
crc start -p pull-secret.txt

As the CRC is very consuming:

[ofircohen@crc-host ~]$ crc status
CRC VM:          Running
OpenShift:       Running (v4.18.1)
Disk Usage:      48.79GB of 128.2GB (Inside the CRC VM)
Cache Usage:     27.93GB
Cache Directory: /home/ofircohen/.crc/cache
[ofircohen@crc-host ~]$

Notice how it expanded from 25GiB disk space to ~50GiB after just 2 days of running straight:

[user@crc-host ~]$ ssh crc uptime
Warning: Permanently added '[127.0.0.1]:2222' (ED25519) to the list of known hosts.
 22:49:23 up 2 days,  1:53,  0 users,  load average: 1.01, 0.83, 0.77
[user@crc-host ~]$

ofirc avatar Jun 19 '25 22:06 ofirc

@codersyacht

I was also facing the same issue. The issue was with the crc-daemon.service, it is running in user scope and it will be stop when the user session is end/logout.

This can be fixed by enabling "Linger" for the user

sudo loginctl enable-linger $USER

For more, read: enable-linger

Once enabled you can restart the crc to verify.

navaneethov avatar Jun 20 '25 12:06 navaneethov