microk8s icon indicating copy to clipboard operation
microk8s copied to clipboard

microk8s add-node fails (without error)

Open AnotherStranger opened this issue 7 months ago • 1 comments

Summary

We tried to create a small microk8s cluster from two existing Ubuntu systems for testing purposes. There appears to be an issue with microk8s where nodes cannot join a cluster, and microk8s keeps crashing on the second node after attempting to join. We had a similar problem with two other systems before. We could resolve the error by reinstalling a fresh Ubuntu server image. However, the problem seems to reappear.

What Should Happen Instead?

Microk8s nodes should be able to successfully join a cluster, and microk8s should not crash on the second node. Workloads should be able to start without pods getting stuck in the "ContainerCreating" state.

Reproduction Steps

  1. Take two existing Ubuntu PCs and install snapd and microk8s (channel 1.29/stable with classic confinement) on both.
  2. Try to join the nodes using microk8s add-node.
  3. Run the join command on the second node, which should finish successfully. a. The command finishes successfully:
    microk8s join 192.168.0.100:25000/<redacted>
    Contacting cluster at 192.168.0.100
    Waiting for this node to finish joining the cluster. .. .. .. ..  
    Successfully joined the cluster.
    
  4. Notice that kubectl get no does not list the new node, indicating that the node has not actually joined the cluster. a. Notice that kubectl get no will not list the new node.
    kubectl get no
    NAME   STATUS   ROLES    AGE   VERSION
    ws15   Ready    <none>   56m   v1.29.4
    
  5. Observe that Microk8s keeps crashing on the second node, with Kubelite appearing to be the culprit based on Journalctl logs. a. Errors:
    Jul 23 12:28:58 ws14 microk8s.daemon-kubelite[68619]: W0723 12:28:58.605269   68619 reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PodDisruptionBudget: Get "https://127.0.0.1:16443/apis/policy/v1/poddisruptionbudgets?limit=500&resourceVersion=0": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "10.152.183.1")
    Jul 23 12:28:58 ws14 microk8s.daemon-kubelite[68619]: E0723 12:28:58.605368   68619 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: Get "https://127.0.0.1:16443/apis/policy/v1/poddisruptionbudgets?limit=500&resourceVersion=0": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "10.152.183.1")
    Jul 23 12:28:58 ws14 microk8s.daemon-kubelite[68619]: E0723 12:28:58.748636   68619 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
    Jul 23 12:28:58 ws14 microk8s.daemon-kubelite[68619]: E0723 12:28:58.948778   68619 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
    Jul 23 12:28:59 ws14 microk8s.daemon-kubelite[68619]: E0723 12:28:59.149109   68619 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
    Jul 23 12:28:59 ws14 microk8s.daemon-kubelite[68619]: E0723 12:28:59.265682   68619 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
    
  6. On the master node, notice some errors in Kubelite as well, but microk8s status returns just fine. a. Errors:
    ./inspection-report/snap.microk8s.daemon-kubelite/journal.log:Jul 23 11:41:55 ws15 microk8s.daemon-kubelite[647979]: W0723 11:41:55.632296  647979 logging.go:59] [core] [Channel #135 SubChannel #136] grpc: addrConn.createTransport failed to connect to {Addr: "unix:///var/snap/microk8s/6809/var/kubernetes/backend/kine.sock:12379", ServerName: "kine.sock:12379", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/snap/microk8s/6809/var/kubernetes/backend/kine.sock:12379: connect: connection refused"
    ./inspection-report/snap.microk8s.daemon-kubelite/journal.log:Jul 23 11:41:57 ws15 microk8s.daemon-kubelite[647979]: W0723 11:41:57.631401  647979 logging.go:59] [core] [Channel #72 SubChannel #73] grpc: addrConn.createTransport failed to connect to {Addr: "unix:///var/snap/microk8s/6809/var/kubernetes/backend/kine.sock:12379", ServerName: "kine.sock:12379", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/snap/microk8s/6809/var/kubernetes/backend/kine.sock:12379: connect: connection refused"
    
  7. When trying to start a workload on the master node, the pod gets stuck in the "ContainerCreating" state, and kubectl describe pod shows events indicating that the pod sandbox changed and will be killed and re-created. a. kubectl describe pod:
    Events:
      Type    Reason          Age                   From     Message
      ----    ------          ----                  ----     -------
      Normal  SandboxChanged  3m41s (x50 over 13m)  kubelet  Pod sandbox changed, it will be killed and re-created.
    

Introspection Report

Start (fresh install of microk8s with snap remove --purge)
State after running microk8s join from the second node

Additional System Info

  • Setup: For this we created a small network using a router and configured both machines with static IP-Adresses. Both machines can access the internet via NAT.
  • uname -a:
    Linux ws15 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
    
  • Release:
    Distributor ID:	Ubuntu
    Description:	Ubuntu 22.04.4 LTS
    Release:	22.04
    Codename:	jammy
    
  • sudo apt show snapd:
    Package: snapd
    Version: 2.63+22.04
    Built-Using: apparmor (= 3.0.4-2ubuntu2.4), libcap2 (= 1:2.44-1ubuntu0.22.04.1), libseccomp (= 2.5.3-2ubuntu2)
    Priority: optional
    Section: devel
    Origin: Ubuntu
    Maintainer: Ubuntu Developers <[email protected]>
    Bugs: https://bugs.launchpad.net/ubuntu/+filebug
    Installed-Size: 104 MB
    Depends: adduser, apparmor (>= 2.10.95-0ubuntu2.2), ca-certificates, fuse3 (>= 3.10.5-1) | fuse, openssh-client, squashfs-tools, systemd, udev, default-dbus-session-bus | dbus-session-bus, libc6 (>= 2.34), libfuse3-3 (>= 3.2.3), liblzma5 (>= 5.1.1alpha+20120614), liblzo2-2 (>= 2.02), libudev1 (>= 183), zlib1g (>= 1:1.1.4)
    Recommends: gnupg
    Suggests: zenity | kdialog
    Conflicts: snap (<< 2013-11-29-1ubuntu1)
    Breaks: snap-confine (<< 2.23), snapd-xdg-open (<= 0.0.0), ubuntu-core-launcher (<< 2.22), ubuntu-snappy (<< 1.9), ubuntu-snappy-cli (<< 1.9)
    Replaces: snap-confine (<< 2.23), snapd-xdg-open (<= 0.0.0), ubuntu-core-launcher (<< 2.22), ubuntu-snappy (<< 1.9), ubuntu-snappy-cli (<< 1.9)
    Homepage: https://github.com/snapcore/snapd
    Task: server-minimal, ubuntu-desktop-minimal, ubuntu-desktop, cloud-image, ubuntu-desktop-raspi, ubuntu-wsl, server, ubuntu-server-raspi, kubuntu-desktop, xubuntu-core, xubuntu-desktop, lubuntu-desktop, ubuntustudio-desktop-core, ubuntustudio-desktop, ubuntukylin-desktop, ubuntu-mate-core, ubuntu-mate-desktop, ubuntu-budgie-desktop, ubuntu-budgie-desktop-raspi
    Download-Size: 25,9 MB
    APT-Manual-Installed: yes
    APT-Sources: http://de.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
    

Can you suggest a fix?

Unfortunately no.

Are you interested in contributing with a fix?

I will gladly help, if I can.

AnotherStranger avatar Jul 23 '24 11:07 AnotherStranger