microk8s.daemon-kubelite produces tons of error logs on all nodes
Summary
We have a 4-node Microk8s HA Cluster running for 2 years, recently we found that the "microk8s.daemon-kubelite" service on all nodes produces tons of error logs like this:
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.421789 205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.668481 205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.691204 205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.900976 205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.949677 205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
I tried to restart this service by running systemctl restart snap.microk8s.daemon-kubelite but it did not help, searched this error message around the web but did not find anything helpful.
All pods seem running fine, and I am still able to update our deployments (but the update progress is much slower than before).
Can someone help me resolve this problem?
Cluster status:
root@svr02:~# microk8s.status
microk8s is running
high-availability: yes
datastore master nodes: 172.16.40.232:19001 172.16.40.231:19001 172.16.40.233:19001
datastore standby nodes: 172.16.218.180:19001
addons:
enabled:
dns # CoreDNS
ha-cluster # Configure high availability on the current node
ingress # Ingress controller for external access
metrics-server # K8s Metrics Server for API access to service metrics
prometheus # Prometheus operator for monitoring and logging
rbac # Role-Based Access Control for authorisation
storage # Storage class; allocates storage from host directory
microk8s inspect:
root@svr02:~# microk8s inspect
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy current linux distribution to the final report tarball
Copy openSSL information to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting juju
Inspect Juju
Inspecting kubeflow
Inspect Kubeflow
Inspecting dqlite
Inspect dqlite
Building the report tarball
Report tarball is at /var/snap/microk8s/4916/inspection-report-20240924_162747.tar.gz
Hello @PRNDA,
thank you for reporting your issue with us.
Could you please upload the inspection report that you have created under /var/snap/microk8s/4916/inspection-report-20240924_162747.tar.gz please? With this information we can better assist you to resolve the issue.
Thank you!
Hello @PRNDA,
thank you for reporting your issue with us.
Could you please upload the inspection report that you have created under
/var/snap/microk8s/4916/inspection-report-20240924_162747.tar.gzplease? With this information we can better assist you to resolve the issue.Thank you!
I created this inspection report yesterday, but I found some sensitive information in the logs, so I decided not to upload it here, Is there a way that I can send it to you privately?
Hi @PRNDA,
how would you prefer to share it? Would you be able to upload the inspection report somewhere we could pull it from?
Hi @louiseschmidtgen ,
I created a private repo here, and uploaded the inspection file into this repo, could you please accept my repo invitation first and then download this inspection file?
Sorry for the inconvenience.
Hello @PRNDA ,
I have received your invitation and have access to the logs.
Thank you for sharing the inspection report, I will be having a look shortly.
Linking this issue as possibly related: https://github.com/canonical/microk8s/issues/4293
Hello @PRNDA,
are you able to reproduce this issue on a more recent MicroK8s snap? You are currently running on v1.23 which is out of support.
With kind regards, Louise
Hello @PRNDA,
are you able to reproduce this issue on a more recent MicroK8s snap? You are currently running on v1.23 which is out of support.
With kind regards, Louise
I'm afraid I can not, this is a production system, and I'm not allowed to upgrade it.
Have you tried deleting Calico-Node pods?
Have you tried deleting Calico-Node pods?
Will this interrupt the running pods?
Have you tried deleting Calico-Node pods?
Will this interrupt the running pods?
Deleting the Calico-Node pods should not interrupt the execution of other pods, as Kubernetes will automatically re-schedule new Calico-Node pods to maintain network connectivity. However, there might be a temporary disruption in pod networking while the new Calico pods start.
Have you tried deleting Calico-Node pods?
Will this interrupt the running pods?
Deleting the Calico-Node pods should not interrupt the execution of other pods, as Kubernetes will automatically re-schedule new Calico-Node pods to maintain network connectivity. However, there might be a temporary disruption in pod networking while the new Calico pods start.
There might be a temporary disruption in pod networking
That's what I'm worried about, as this cluster is running several online systems, I don't want them to be affected.