k3s k3s log fill up my disk in short time

Environmental Info: K3s Version: 1.25.6

Node(s) CPU architecture, OS, and Version:

x86_64, ubuntu 22.04 Cluster Configuration:

3 server 3 nodes Describe the bug:

I have the fresh installed cluster running for 3-4 days, suddenly one of the master get filled up by k3s-service.log. which keep printing

msg="Failed to test temporary data store connection: failed to dial endpoint http://127.0.0.1:2399 with maintenance client: context canceled"

MILLIONS line of this text make the k3s-service.log grow into hundreds of GB in a couple of hours.

Steps To Reproduce:

Installed K3s: install 1.25.6 Expected behavior:

cluster nodes keep running stalely. Actual behavior:

one of the master get filled up with massive log printout, then kill the node in the end.

Additional context / logs:

It keep printing

msg="Failed to test temporary data store connection: failed to dial endpoint http://127.0.0.1:2399 with maintenance client: context canceled"

Mar 21 '23 09:03 liyimeng

You'll need to provide more than just the one repeating log message. Can you go back in the logs to just before that message started repeating, or perhaps just stop k3s, clean up the logs, and then start it again so that you can get the logs from the beginning of startup onwards?

You might also confirm that nothing else obvious has gone wrong with this host, such as running out of disk space.

Mar 21 '23 19:03 brandond

@brandond Thanks for attention! Yes, I know the log provide no clue here. When I see the issue, the log file is 400GB+, impossible to see the beginning part of log. I restart the service to recollect the log, but the problem is gone when doing so.

So losing the chance to collect meaningful log. Is this because of something going wrong with embedded etcd?

Btw, my friend said he experienced the same on 1.23.10. Rebooting the node and the problem is gone.

I will try to see if I can collect a meaningful log when it occurs again.

Mar 22 '23 09:03 liyimeng

It is happening again, I observe that there are more than one k3s server instances are running on the node, even I have stopped the k3s-service.

ps -ef | grep  server | grep k3s
root     11974     1 99 17:54 ?        00:11:03 /sbin/k3s server
root     15326     1 99 16:15 ?        03:30:54 /sbin/k3s server
root     27884     1 47 16:14 ?        00:50:24 /sbin/k3s server
root     32143     1 99 17:50 ?        00:18:19 /sbin/k3s server

My system use openrc to start the service. On the normal node, I have

ps -ef | grep  server | grep k3s
root     37587     1  0 13:49 ?        00:00:00 supervise-daemon k3s-service --start --stdout /var/log/k3s-service.log --stderr /var/log/k3s-service.log --pidfile /var/run/k3s-service.pid --respawn-delay 5 --respawn-max 0 /sbin/k3s -- server --disable servicelb --server https://kubernetes --node-external-ip 172.27.13.170 --protect-kernel-defaults=true --secrets-encryption=true --kube-apiserver-arg=audit-policy-file=/var/lib/rancher/k3s/server/audit.yaml --kube-apiserver-arg=audit-log-path=/var/lib/rancher/k3s/server/audit/audit.log --kube-apiserver-arg=audit-log-maxage=30 --kube-apiserver-arg=audit-log-maxbackup=10 --kube-apiserver-arg=audit-log-maxsize=100 --kube-apiserver-arg=request-timeout=300s --kube-apiserver-arg=service-account-lookup=true --kube-apiserver-arg=enable-admission-plugins=NodeRestriction,PodSecurity,NamespaceLifecycle,ServiceAccount --kube-apiserver-arg=feature-gates=MemoryQoS=true,PodSecurity=true --kube-controller-manager-arg=terminated-pod-gc-threshold=10 --kube-controller-manager-arg=use-service-account-credentials=true --kubelet-arg=streaming-connection-idle-timeout=5m --kubelet-arg=make-iptables-util-chains=true --node-label k3os.io/mode=local --node-label k3os.io/version=0404260
root     37588 37587 27 13:49 ?        01:09:30 /sbin/k3s server

For some reason, k3s-service script dose not actually kill '/sbin/k3s server' processes, the left over process conflict to each other, and racing to write into the log files, hence collecting GBs of log in a couple of minutes.

@brandond Is there any chance we can improve create_openrc_service_file() in the install.sh, make it robust, avoiding such situation from happening?

Mar 27 '23 10:03 liyimeng

@liyimeng is this still an issue for you? I see the open PR, but it's been some time without an update. Thanks!

Jan 05 '24 21:01 caroline-suse-rancher

@caroline-suse-rancher Thanks for your attention! I have been using the solution in the PR to solve this problem. So far so good. Not sure if it can help others.

Jan 06 '24 20:01 liyimeng