Attempts to join worker into cluster that are already members
I've migrated a small cluster from 5 old VMs over to 10 new VMs by first adding the 10 new nodes and then removing the old ones. I originally intended to migrate data from old to new nodes, but later concluded that won't be possible due to network issues between the old and new nodes.
So what I did is remove the old controllers nodes by removing EtcdMembers, then their Node and finally k0s stop and k0s reset. I then re-ran k0sctl with the 5 old nodes removed (to update certificates). It did update the controller+worker nodes just fine and then went on to attempt to join the remaining 7 worker nodes into the cluster, which are already in the cluster. Not only that, but it eventually fails to do so while seemingly looking for a /var/k0s/pki/admin.conf, which never appears. (sudo -- test -e /var/k0s/pki/admin.conf from the logs).
So the end result is now, that I have an operational cluster, but I'm not currently able to manage it using k0sctl, since it repeatedly attempts to join workers into the cluster and fails.
So, I assume you see this for the workers:
log.Infof("%s: checking if worker %s has joined", p.leader, h.Metadata.Hostname)
After looking for admin.conf, k0sctl should fall back to using kubelet.conf for running kubectl:
// KubeconfigPath returns the path to a kubeconfig on the host
func (l *Linux) KubeconfigPath(h os.Host, dataDir string) string {
linux := &os.Linux{}
// if admin.conf exists, use that
adminConfPath := path.Join(dataDir, "pki/admin.conf")
if linux.FileExist(h, adminConfPath) {
return adminConfPath
}
return path.Join(dataDir, "kubelet.conf")
}
// KubectlCmdf returns a command line in sprintf manner for running kubectl on the host using the kubeconfig from KubeconfigPath
func (l *Linux) KubectlCmdf(h os.Host, dataDir, s string, args ...interface{}) string {
return fmt.Sprintf(`env "KUBECONFIG=%s" %s`, l.KubeconfigPath(h, dataDir), l.K0sCmdf(`kubectl %s`, fmt.Sprintf(s, args...)))
}
I don't think the problem is the missing admin.conf (which is expected on a worker node)
I think what happens is that k0sctl does not see the node report a Ready status and therefore thinks they are new nodes.
If you run something like this on the workers:
sudo env KUBECONFIG=/var/lib/k0s/kubelet.conf k0s --data-dir=/var/lib/k0s kubectl get node -l kubernetes.io/hostname=$(hostname) -o json
You should be able to see what the workers report as their status.
Logs would be helpful in diagnosing this.
Ok the command you provided showed me, that this kubelet.conf still uses the IP of one of the old controller+worker nodes. it surprises me, that the node IPs are even used, given that this cluster has NLLB enabled.
Yes, switching the config to https://127.0.0.1:7443 gives the expected nodes json output
Given that this kubelet.conf isn't actively used by the cluster itself for communication, because that would have resulted in a broken cluster, I wonder: What is this file used for? Is it only used by k0s/k0sctl for cluster provisioning/node bootstrapping? And what would happen if that file disappeared?
https://github.com/k0sproject/k0s/blob/9d5247833e12278e19b7b5260820570175d851fe/pkg/component/worker/utils.go#L41
I guess that's the logic in k0s that creates the file and from looking at one of the workers it seems that deleting the kubelet.conf would not result in it being recreated, at least not without k0sctl providing a new join token.
@kke luckily I still had the old VMs around to check: the kubelet.conf pointed to the first controller node of the original 3 controllers before I created new controllers. Does that mean that I can't reboot these machines while the single controller referenced in that config is not running?
I just manually fixed all nodes to point to a valid controller. k0sctl now correctly detects that the workers are part of the cluster already and doesn't touch them, however it does reinstall the controllers on every repeated apply run, even though nothing changed in between.