k3s icon indicating copy to clipboard operation
k3s copied to clipboard

k3s-agent Fails to start with with embedded registry and kill entire OS

Open ElectroshockGuy opened this issue 9 months ago • 3 comments

Describe the bug:

I am experiencing an issue where the k3s-agent fails to start properly. During the startup process, the file /var/lib/rancher/k3s/agent/containerd/peer.key is generated but its content is empty, which is quite unusual. When I attempt to delete the /var/lib/rancher/k3s/agent/containerd/peer.key file and then restart the k3s-agent, the system immediately freezes and then reboots.

Environmental Info: K3s Version: v1.28.9+k3s1

Node(s) CPU architecture, OS, and Version:

cpu: 16 os: ubuntu 24.04 (kairos)

Cluster Configuration: 2 servers, 1 agents

Steps To Reproduce:

  • server enabled --embedded-registry
  • add worker
  • woker`s os start k3s-agent.service

Additional context / logs:

May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Using private registry config file at /etc/rancher/k3s/registries.yaml"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Module overlay was already loaded"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Module nf_conntrack was already loaded"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Module br_netfilter was already loaded"
May 15 06:36:21 node7 k3s[1819]: E0515 06:36:21.555610    1819 remote_runtime.go:294] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\"" filter="nil"
May 15 06:36:21 node7 k3s[1819]: E0515 06:36:21.555713    1819 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\""
May 15 06:36:21 node7 k3s[1819]: E0515 06:36:21.555778    1819 generic.go:238] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\""
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_max' to 524288"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Set sysctl 'net/ipv4/conf/all/forwarding' to 1"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Starting distributed registry mirror at https://10.11.111.63:6443/v2 for registries [docker.io registry.k8s.io]"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=fatal msg="failed to start embedded registry: failed to load or generate p2p private key: error loading key from /var/lib/rancher/k3s/agent/containerd/peer.key: <nil>"
May 15 06:36:21 node7 systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ An ExecStart= process belonging to unit k3s-agent.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
May 15 06:36:21 node7 systemd[1]: k3s-agent.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit k3s-agent.service has entered the 'failed' state with result 'exit-code'.
May 15 06:36:21 node7 systemd[1]: k3s-agent.service: Consumed 1.283s CPU time, 200.0M memory peak, 0B memory swap peak.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit k3s-agent.service completed and consumed the indicated resources.
May 15 06:36:26 node7 systemd[1]: k3s-agent.service: Scheduled restart job, restart counter is at 1.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ Automatic restarting of the unit k3s-agent.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
May 15 06:36:26 node7 systemd[1]: Starting k3s-agent.service - Lightweight Kubernetes...
░░ Subject: A start job for unit k3s-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit k3s-agent.service has begun execution.
░░ 
░░ The job identifier is 1173.
May 15 06:36:26 node7 sh[2387]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
May 15 06:36:27 node7 k3s[2397]: time="2024-05-15T06:36:27Z" level=info msg="Starting k3s agent v1.28.9+k3s1

ElectroshockGuy avatar May 15 '24 08:05 ElectroshockGuy

I also experience the same :( my guessing is that during key generation, the system boot for some reason, causing the generated key got no chance to write to disk, hence result in the following errors

liyimeng avatar May 16 '24 09:05 liyimeng

During the startup process, the file /var/lib/rancher/k3s/agent/containerd/peer.key is generated but its content is empty May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=fatal msg="failed to start embedded registry: failed to load or generate p2p private key: error loading key from /var/lib/rancher/k3s/agent/containerd/peer.key: <nil>"

This error is coming from https://github.com/rancher/dynamiclistener/blob/e590d58b896cc8dd33dde7cec80c52e23ec08189/cert/io.go#L89 - the message suggests that the file was created by a previous startup of k3s, but for some reason the file contents have been lost. Your best bet is probably to just delete the file from disk and let it be recreated on startup. You might be able to find other errors in the logs to suggest why the file has no contents or its contents are corrupted, but given that this node is also rebooting unexpectedly, I suspect that you may have lost data from your filesystem when the system crashed.

When I attempt to delete the /var/lib/rancher/k3s/agent/containerd/peer.key file and then restart the k3s-agent, the system immediately freezes and then reboots.

That sounds like a problem with your node; K3s shouldn't be capable of doing anything that would cause it to panic and reboot. You'll need to figure that out on your own.

brandond avatar May 16 '24 17:05 brandond

@Brandon agree with you! I manage to switch to an openrc system and test the same k3s version, all work as expected. systemd seems playing devil here. :(

liyimeng avatar May 17 '24 03:05 liyimeng

strange, when rolling back to 1.28.6, it runs ok with no issue.

liyimeng avatar May 18 '24 09:05 liyimeng

I have found another potential cause. As I understand, when running with systemd, the cgroup driver should be systemd, however, I found k3s mistaken it as cgroupfs, not sure if this is the issue.

liyimeng avatar May 20 '24 06:05 liyimeng

I'm not aware of any defect in k3s that would cause it to use cgroupfs instead of systemd, when using the embedded containerd on a systemd-based OS. You're not trying to use docker or another user-provided container runtime, are you?

brandond avatar May 20 '24 19:05 brandond

no, I use kairos from https://github.com/kairos-io/kairos/, which should have no other runtime available. In addition to that, I add some additional printout and find

ARN[0002] isRunningInUserNS=false, cgroup controller map[cpu:true cpuset:true hugetlb:true io:true memory:true misc:true pids:true rdma:true], INVOCATION_ID= 

INVOCATION_ID is empty, something go wrong with systemd, it should set this value.

This is very likely systemd issue in their distribution, I will shout out loud there. :D

liyimeng avatar May 21 '24 02:05 liyimeng