crc
crc copied to clipboard
[BUG][Ubuntu][Bare metal] OpenShift cluster becomes Unreachable
General information
- OS: Linux
- Hypervisor: QEMU / libvirt
- Did you run
crc setup
before starting it: Yes - Running CRC on: Baremetal-Server
CRC version
CRC version: 2.20.0+f3a947
CRC status
DEBU CRC version: 2.20.0+f3a947
DEBU OpenShift version: 4.13.0
DEBU Podman version: 4.4.4
DEBU Running 'crc status'
CRC VM: Running
OpenShift: Unreachable (v4.13.0)
Disk Usage: 0B of 0B (Inside the CRC VM)
Cache Usage: 26.29GB
Cache Directory: /home/ubuntu/.crc/cache
CRC config
- consent-telemetry : no
- cpus : 12
- memory : 27648
- network-mode : user
Host Operating System
# Put the output of `cat /etc/os-release`
PRETTY_NAME="Ubuntu 22.04.2 LTS"
Steps to reproduce
- Follow the instructions listed in https://github.com/crc-org/crc/wiki/Debugging-guide
-
crc ip
returns127.0.0.1
-
ssh -i ~/.crc/machines/crc/id_ecdsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null [email protected]
returns[email protected]: Permission denied (publickey).
Expected
SSH into the VM should be possible. However, could it be due to the fact that crc ip
is returning the IP address of the loopback interface, and should return a 192.168.x.x
IP instead?
ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 74:56:3c:56:f1:1e brd ff:ff:ff:ff:ff:ff
inet 162.19.88.183/24 metric 100 brd 162.19.88.255 scope global dynamic enp3s0f0
valid_lft 62233sec preferred_lft 62233sec
inet6 2001:41d0:306:2bb7::/56 scope global
valid_lft forever preferred_lft forever
inet6 fe80::7656:3cff:fe56:f11e/64 scope link
valid_lft forever preferred_lft forever
4: enp3s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 74:56:3c:56:f1:1f brd ff:ff:ff:ff:ff:ff
5: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 52:54:00:d5:4d:5a brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
Worth noting that what prompted me to try to SSH into the VM is the fact that the OpenShift cluster is appearing as Unreachable
(I've read through this previous issue by the way: https://github.com/crc-org/crc/issues/1888).
Logs
Before gather the logs try following if that fix your issue
$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug
The cluster starts without visible issues, without needing to delete the existing install.
With user-mode networking (crc config get networking-mode
), 127.0.0.1 is expected. In this configuration, the ssh port for the VM is remapped to 2222. Port 22 will try to connect to the host SSH daemon, not to the VM.
Thanks, added 2222
in the ssh config (which I was convinced I had previously done, but apparently not). I am now able to SSH in the VM. For the record:
crc config get network-mode
network-mode : user
@cfergeau on a sidenote, does it make sense to add a documentation to describe the usage of this with an ssh config entry, and perhaps even set the correct key (or add configuration to use the user's ssh key?). WDYT?
This is already documented in https://github.com/crc-org/crc/wiki/Debugging-guide
Running into another issue now:
crc status
CRC VM: Running
OpenShift: Unreachable (v4.13.0)
Disk Usage: 0B of 0B (Inside the CRC VM)
Cache Usage: 26.29GB
Cache Directory: /home/ubuntu/.crc/cache
So the VM is Running, but I am getting a ssh: connect to host 127.0.0.1 port 2222: Connection refused
error.
Verbose output:
OpenSSH_8.9p1 Ubuntu-3ubuntu0.3, OpenSSL 3.0.2 15 Mar 2022
debug1: Reading configuration data /home/ubuntu/.ssh/config
debug1: /home/ubuntu/.ssh/config line 1: Applying options for crc
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 127.0.0.1 is address
debug3: ssh_connect_direct: entering
debug1: Connecting to 127.0.0.1 [127.0.0.1] port 2222.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: connect to address 127.0.0.1 port 2222: Connection refused
ssh: connect to host 127.0.0.1 port 2222: Connection refused
Here's the ssh config:
Host crc
Hostname 127.0.0.1
User core
Port 2222
IdentityFile ~/.crc/machines/crc/id_ecdsa
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
sudo lsof -i:2222
returns no output. Have I missed anything here?
It shows "unreachable" so something is afoot/weird; the status command itself uses SSH to check the disk usage. This returns 0
, so it feels like something crashed inside the VM (or the daemon process). Can you check what the console says from the hypervisor management tools?
https://github.com/crc-org/crc/wiki/Debugging-guide
This is wiki, not documentation.
It shows "unreachable" so something is afoot/weird; the status command itself uses SSH to check the disk usage. This returns
0
, so it feels like something crashed inside the VM (or the daemon process). Can you check what the console says from the hypervisor management tools?
I am seeing frequent occurences of those logs lines produced by the libvirtd.service
systemd unit:
Aug 22 13:01:18 ns3220936 libvirtd[226934]: 226934: error : virNetSocketReadWire:1793 : End of file while reading data: Input/output error
However, they also occur when the OpenShift cluster is in a Running state so I don't think this is the root cause of this issue. Since we're using QEMU / libvirt on this host, is there anything else I could check?
sudo virsh list --all
will show the state of the crc
VM.
At this point, I'd try crc stop; crc start
to see if it helps, or crc delete; crc start
if you don't mind the cluster data loss.
Another thing to try is to kill the crc daemon
, and stop/start crc.
Is the crc daemon as process still active? Since if that disappeared, the VM becomes unreachable.
When the daemon process is 'stopped' on Windows, it responds back with:
PS> crc status
crc does not seem to be setup correctly, have you run 'crc setup'?
but when the process is restarted it shows what you show us:
PS> crc status
CRC VM: Running
MicroShift: Running (v4.13.0)
RAM Usage: 1.139GB of 3.902GB
Disk Usage: 8.216GB of 16.1GB (Inside the CRC VM)
Cache Usage: 20.72GB
Cache Directory: C:\Users\gbraad\.crc\cache
PS> crc status
crc does not seem to be setup correctly, have you run 'crc setup'?
PS> crc setup
INFO Using bundle path C:\Users\gbraad\.crc\cache\crc_microshift_hyperv_4.13.0_amd64.crcbundle
INFO Checking minimum RAM requirements
INFO Checking if current user is in crc-users and Hyper-V admins group
INFO Checking if CRC bundle is extracted in '$HOME/.crc'
INFO Checking if C:\Users\gbraad\.crc\cache\crc_microshift_hyperv_4.13.0_amd64.crcbundle exists
INFO Checking if the daemon task is installed
INFO Checking if the daemon task is running
INFO Running the daemon task
INFO Checking admin helper service is running
Your system is correctly setup for using CRC. Use 'crc start' to start the instance
PS> ssh core@localhost -p 2222 -I C:\Users\gbraad\.crc\machines\crc\id_ecdsa.pub
ssh: connect to host localhost port 2222: Connection refused
PS> crc status
CRC VM: Running
MicroShift: Unreachable (v4.13.0)
Disk Usage: 0B of 0B (Inside the CRC VM)
Cache Usage: 20.72GB
Cache Directory: C:\Users\gbraad\.crc\cache
Note: while tested on Windows, this should have similar behaviour on Linux
On Linux the daemon is automagically restarted with socket activation. This means that when the process is gone (crashed), we just restart and provide a new instance. However, in that case no forwards have been setup. This is state that has been lost as we do not record this to be re-established.
Functionality can be restored by:
PS> crc stop
PS> crc start
PS> crc stop
INFO Stopping the instance, this may take a few minutes...
Stopped the instance
PS> crc start
WARN A new version (2.25.0) has been published on https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.25.0/crc-windows-installer.zip
INFO Checking minimum RAM requirements
INFO Checking if running in a shell with administrator rights
INFO Checking Windows release
INFO Checking Windows edition
INFO Checking if Hyper-V is installed and operational
INFO Checking if Hyper-V service is enabled
INFO Checking if crc-users group exists
INFO Checking if current user is in crc-users and Hyper-V admins group
INFO Checking if vsock is correctly configured
INFO Checking if the daemon task is installed
INFO Checking if the daemon task is running
INFO Checking admin helper service is running
INFO Loading bundle: crc_microshift_hyperv_4.13.0_amd64...
INFO Starting CRC VM for microshift 4.13.0...
INFO CRC instance is running with IP 127.0.0.1
INFO CRC VM is running
INFO Check internal and public DNS query...
INFO Check DNS query from host...
INFO Starting Microshift service... [takes around 1min]
INFO Waiting for kube-apiserver availability... [takes around 2min]
INFO Adding microshift context to kubeconfig...
Started the MicroShift cluster.
Use the 'oc' command line interface:
PS> & crc oc-env | Invoke-Expression
PS> oc COMMAND
PS> ssh core@localhost -p 2222 -I C:\Users\gbraad\.crc\machines\crc\id_ecdsa.pub
dlopen C:\\Users\\gbraad\\.crc\\machines\\crc\\id_ecdsa.pub failed: Access is denied.
core@localhost's password: <!---- responds
Would like to understand what happened that the daemon stopped working/restarted, before we address the 'self-healing' (as this would hide the issue instead of addressing it directly).
Thanks for following up on that one. So the cluster has become Unreachable
again, not sure exactly when that happened:
crc status --log-level debug
DEBU CRC version: 2.20.0+f3a947
DEBU OpenShift version: 4.13.0
DEBU Podman version: 4.4.4
DEBU Running 'crc status'
CRC VM: Running
OpenShift: Unreachable (v4.13.0)
Disk Usage: 0B of 0B (Inside the CRC VM)
Cache Usage: 26.29GB
Cache Directory: /home/ubuntu/.crc/cache
We can see that the virsh
command output shows that crc
is running:
sudo virsh list --all
Id Name State
----------------------
1 crc running
I can also confirm that the cluster is back to a Running
state if I run crc stop;|&& crc start
. I am also still trying to figure out why the disk usage shows as 0B of 0B
but I haven't found any additional hints at this point.
why the disk usage shows as 0B of 0B but I haven't found any additional hints at this point
@dabcoder Disk usage is shown from VM prospective and that uses ssh connection to be working, when it is in Running
state it will also show correct disk usage.
... when it is in Running state it will also show correct disk usage.
EDIT: didn't mean to close this issue. And yes indeed, it does show the current disk usage when the cluster is back in a Running state:
CRC VM: Running
OpenShift: Running (v4.13.0)
RAM Usage: 11.03GB of 30.79GB
Disk Usage: 27.53GB of 32.68GB (Inside the CRC VM)
Cache Usage: 26.29GB
Cache Directory: /home/ubuntu/.crc/cache
Could you check the daemon state when this happens? systemctl status --user crc-daemon.service
This confirms it is related to the daemon process crashing or stopping.
Is there anything in the journalctl
we can check, or the crcd.log
?
Or systemctl --user show crc-daemon.service -p NRestarts
to see the restart count.
OK so the daemon seems to has crashed:
crc status
CRC VM: Running
OpenShift: Unreachable (v4.13.0)
Disk Usage: 0B of 0B (Inside the CRC VM)
Cache Usage: 26.29GB
Cache Directory: /home/ubuntu/.crc/cache
VM still marked as running:
sudo virsh list --all
Id Name State
----------------------
2 crc running
The daemon isn't restarting apparently:
systemctl --user show crc-daemon.service -p NRestarts
NRestarts=0
The daemon service still running, we can notice some errors though:
systemctl status --user crc-daemon.service
● crc-daemon.service - CRC daemon
Loaded: loaded (/home/ubuntu/.config/systemd/user/crc-daemon.service; static)
Active: active (running) since Mon 2023-08-28 08:15:17 UTC; 1min 57s ago
...
Aug 28 08:15:24 ns3220936 crc[294020]: W0828 08:15:24.584550 294020 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Aug 28 08:15:24 ns3220936 crc[294020]: W0828 08:15:24.761787 294020 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Aug 28 08:15:30 ns3220936 crc[294020]: @ - - [28/Aug/2023:08:15:30 +0000] "GET /api/status HTTP/1.1" 200 154
Perhaps some interesting hints in the crcd.log
file?
time="2023-08-28T08:15:30Z" level=debug msg="Cannot get root partition usage: ssh command error:\ncommand : df -B1 --output=size,used,target /sysroot | tail -1\nerr : dial tcp 127.0.0.1:2222: connect: connection refused\n"
time="2023-08-28T08:15:30Z" level=debug msg="Cannot get RAM usage: ssh command error:\ncommand : awk '/^Mem/ {print $2,$3}' <(free -b)\nerr : dial tcp 127.0.0.1:2222: connect: connection refused\n"
Hope that it's useful. Thanks.
NRestarts=0
Either the service was 'stopped' (like socket activation has a timeout) or a crash does not cause a socket activation to record this as a restart:
As it looks clear that
Active: active (running) since Mon 2023-08-28 08:15:17 UTC; 1min 57s ago
... with an uptime of mere minute(s), would not allow the cluster to have fully started otherwise; this was restarted.
the error messages are describing the effects, as the 'connection refused' makes sense when thee is no forwarding anymore.
Either the service was 'stopped' (like socket activation has a timeout) or a crash does not cause a socket activation to record this as a restart
So this is what's set in the daemon service systemd unit file:
cat /home/ubuntu/.config/systemd/user/crc-daemon.service
[Unit]
Description=CRC daemon
Requires=crc-http.socket
Requires=crc-vsock.socket
[Service]
# This allows systemd to know when startup is not complete (for example, because of a preflight failure)
# daemon.SdNotify(false, daemon.SdNotifyReady) must be called before the startup is successful
Type=notify
ExecStart=/home/ubuntu/.crc/bin/crc daemon
And this is what we have for each unit: crc-http.socket
and crc-vsock.socket
:
cat /home/ubuntu/.config/systemd/user/crc-http.socket
[Unit]
Description=CRC HTTP socket
[Socket]
ListenStream=%h/.crc/crc-http.sock
Service=crc-daemon.service
[Install]
WantedBy=default.target
cat /home/ubuntu/.config/systemd/user/crc-vsock.socket
[Unit]
Description=CRC vsock socket
[Socket]
ListenStream=vsock::1024
Service=crc-daemon.service
[Install]
WantedBy=default.target
Is there anything specific we could do there, in order to avoid the daemon to be stopped?
@dabcoder what version of systemd ubuntu have? Also can you provide journalctl logs for service journalctl --user -u crc-daemon
?
@dabcoder what version of systemd ubuntu have? Also can you provide journalctl logs for service
journalctl --user -u crc-daemon
?
systemd --version
systemd 249 (249.11-0ubuntu3.7)
journalctl
logs attached (last 500 lines).
crc-daemon.log
@dabcoder As per daemon logs there is continue warning around endpoint not connected
which is coming from https://github.com/google/gvisor/blob/master/pkg/tcpip/errors.go#L448 , will inspect what is happening and ask for more info.
Sep 04 13:20:53 ns3220936 crc[356229]: level=info msg="new connection from vm(3):2383898004 to vm(4294967295):1024"
Sep 04 13:21:56 ns3220936 crc[356229]: W0904 13:21:56.762835 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:56 ns3220936 crc[356229]: W0904 13:21:56.853472 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.309752 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.346149 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.804982 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.831189 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.850099 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:57 ns3220936 crc[356229]: W0904 13:21:57.921616 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 13:21:58 ns3220936 crc[356229]: W0904 13:21:58.009339 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
the error messages are describing the effects, as the 'connection refused' makes sense when thee is no forwarding anymore. https://github.com/crc-org/crc/issues/3794#issuecomment-1698522621
We do not have an understanding yet why the daemon restarts/crashes. This is what we need to figure out. It is not Systemd version related as the behaviour from ther service is as expected; it (re)starts on a socket request.
endpoint not connected
This is understandable as the forward is gone... and every status call will do this (as you can see, every 2 secs).
You also run podman desktop with the extension, right?
Would be interesting to see if there are logs from around the time it crashed, though there also seems to be 'legitimate' restarts of the daemon.
Can you try something like journalctl --user -u crc-daemon.service |grep -B 50 "Starting CRC daemon"
?
Reading part of the log does raise eyebrows:
Aug 30 16:04:38 ns3220936 crc[327745]: W0830 16:04:38.708015 327745 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Aug 30 16:12:56 ns3220936 systemd[327429]: Stopping CRC daemon...
Aug 30 16:12:56 ns3220936 systemd[327429]: Stopped CRC daemon.
Aug 30 16:12:56 ns3220936 systemd[327429]: crc-daemon.service: Consumed 2min 6.169s CPU time.
Sep 04 13:20:53 ns3220936 systemd[356017]: Starting CRC daemon...
Sep 04 13:20:53 ns3220936 crc[356229]: level=info msg="using socket provided by crc-vsock.socket"
Sep 04 13:20:53 ns3220936 crc[356229]: level=info msg="using socket provided by crc-http.socket"
Sep 04 13:20:53 ns3220936 systemd[356017]: Started CRC daemon.
Sep 04 13:20:53 ns3220936 crc[356229]: level=info msg="new connection from vm(3):2383898004 to vm(4294967295):1024"
Sep 04 13:21:56 ns3220936 crc[356229]: W0904 13:21:56.762835 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
it shows the errors and a clean 'stopped' about 8 mins later. This might be user interaction?
but the one after the started fails very quickly after this, like within a minute?
You also run podman desktop with the extension, right?
We're using a recent version of podman for Ubuntu which we installed following these instructions: https://podman.io/docs/installation#ubuntu (so not podman desktop).
Version: 3.4.4
API Version: 3.4.4
Go Version: go1.18.1
Built: Thu Jan 1 00:00:00 1970
OS/Arch: linux/amd64
Can you try something like journalctl --user -u crc-daemon.service |grep -B 50 "Starting CRC daemon"?
Just produced the output and we can indeed see:
Sep 04 17:24:51 ns3220936 crc[356229]: W0904 17:24:51.075500 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:24:57 ns3220936 crc[356229]: W0904 17:24:57.018924 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:24:57 ns3220936 crc[356229]: W0904 17:24:57.449783 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:41:06 ns3220936 crc[356229]: W0904 17:41:06.743005 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:41:07 ns3220936 crc[356229]: W0904 17:41:07.262890 356229 gonet.go:457] ep.GetRemoteAddress() failed: endpoint not connected
Sep 04 17:44:55 ns3220936 systemd[356017]: Stopping CRC daemon...
Sep 04 17:44:55 ns3220936 systemd[356017]: Stopped CRC daemon.
Sep 04 17:44:55 ns3220936 systemd[356017]: crc-daemon.service: Consumed 1min 31.444s CPU time.
Sep 06 08:17:01 ns3220936 systemd[376139]: Starting CRC daemon...
This stopping CRC daemon action on the 4th of September 5.44pm UTC wasn't manual or user initiated so to speak. It seems to me that the daemon stopped and only restarted on the 6th of September (without user interaction since I didn't start it earlier today).
it shows the errors and a clean 'stopped' about 8 mins later. This might be user interaction?
Aug 30 16:12:56 ns3220936 systemd[327429]: Stopping CRC daemon...
Aug 30 16:12:56 ns3220936 systemd[327429]: Stopped CRC daemon.
Aug 30 16:12:56 ns3220936 systemd[327429]: crc-daemon.service: Consumed 2min 6.169s CPU time.
Sep 04 13:20:53 ns3220936 systemd[356017]: Starting CRC daemon...
Those ones were user related, yes, manual stop and start.
but the one after the started fails very quickly after this, like within a minute?
I didn't actually check the OpenShift state, whether it was Running or Unreachable at that time unfortunately.
Is there any ubuntu security which is causing the daemon failure or endpoint connection issue? Tomorrow I will try to setup a ubuntu system and see if able to reproduce.
Is there any ubuntu security which is causing the daemon failure or endpoint connection issue? Tomorrow I will try to setup a ubuntu system and see if able to reproduce.
No I don't think so. I couldn't find anything that would cause this or interfere with any (local) connection.