talos
talos copied to clipboard
Control plane node won't start with VIP and a fqdn API on Talos 1.0.4
Bug Report
Control plane node fails to come up on 1.0.4 when vip is set
Description
Node is configured with stage-1.txt. Log file attached.
After the bootstrap, the machine fails to connect to the api via localhost:6443
10.0.40.64: user: warning: [2022-05-03T15:36:45.129172524Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Secret/bootstrap-token-zl18ku: Get \x5c"https://localhost:6443/api?timeout=32s\x5c": dial tcp [::1]:6443: connect: connection refused"}
10.0.40.64: user: warning: [2022-05-03T15:37:32.479313524Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\x5c"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\x5c") has prevented the request from succeeding"}
However, the fqdn of the API is staging.zinn.ca.
config.txt is a dump of
talosctl -n 10.0.40.64 read /system/secrets/kubernetes/kube-controller-manager/kubeconfig > config.txt
Note the cluster definition:
clusters:
- name: staging cluster: server: https://localhost:6443/
I submit it should actually be the fqdn of the API
If I reboot the machine and let it come up with 1.0.1, it is correct and the node fully comes online. Rebooting with 1.0.4 and it doesn't start up.
Logs
stage-1.txt log.txt config.txt
Environment
- Talos version: [
talosctl version --nodes <problematic nodes>]
$ talosctl version -n 10.0.40.64
Client:
Tag: v1.0.4
SHA: f6696063
Built:
Go version: go1.17.7
OS/Arch: darwin/arm64
Server:
NODE: 10.0.40.64
Tag: v1.0.4
SHA: f6696063
Built:
Go version: go1.17.7
OS/Arch: linux/amd64
Enabled: RBAC
-
Kubernetes version: [
kubectl version --short] Not booted up -
Platform: Proxmox VM (4Gb RAM, 20Gb disk, 8 cores)
Rebooted with 1.0.1 and the node came up - however, /system/secrets/kubernetes/kube-controller-manager/kubeconfig still referred to localhost
The fact that it refers to localhost is expected - control plane nodes talk directly to the local API server.
Looking at the log.txt - it feels like everything on Talos side worked correctly. I don't see any errors or anything which would point at the problem. The errors printed are expected until the control plane is up - Talos reconciles the state and retries all transient errors.
Talos wrote down static pod definitions to disk for the control plane components, kubelet was up, so it should have picked them up and started the pods. But looks like pods don't start for whatever reason.
In order to investigate further, we can do:
talosctl c -k
talosctl logs kubelet
If the API server fails to start, logs can be found manually in /var/log/pods/... on the node using talosctl ls and talosctl read
That is strange, just to double check, are the files visible in /etc/kubernetes/manifests?
$ talosctl -n 10.0.40.64 ls /etc/kubernetes/manifests/ NODE NAME 10.0.40.64 . 10.0.40.64 talos-kube-apiserver.yaml 10.0.40.64 talos-kube-controller-manager.yaml 10.0.40.64 talos-kube-scheduler.yaml
I have no good idea, unless it's something within the kubelet. My only guess is trying to roll back kubelet to 1.23.5 (this can be done on the live node with talosctl edit mc and updating the kubelet image reference).
can you check the logs from /var/log/containers, there should be one for apiserver if it ever started
But it should have nothing to do with the control plane endpoint or the VIP. Static pods should work no matter what is the API server connection state.
Rolling kubelet back to 1.23.5 unblocked it and node came up. It was 1.23.6 previously
Ok, interesting. I don't see anything in the changelog which would immediately suggest what the issue might be: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.23.md#changelog-since-v1235
Also, if I do talosctl reset -n 10.0.40.64, the process gets bogged down because the registry node watch fails since the VIP address has already been taken down. The machine needs to be manually rebooted.
So, I changed the kubelet version in the config that I applied via apply-config to 1.23.5 and it got hung up again. edit machineconfig and changed it to 1.23.6 unblocked it and node came up.
I think kubelet version might be red herring, but rather it's something with the kubelet not picking up the static pod definition and restarting the kubelet fixes that.
talosctl reset is known to be weird when the node is not healthy, so it needs probably --graceful=false for now
Agreed on the red herring and also that kubelet is perhaps getting started before the static pod definition is fully written? Restarting it picks up the updated info and all happens fine after that.
kubelet is supposed to watch that directory, so it shouldn't matter when it gets written, but you might have uncovered some bug.
@smira I've also hit this issue when using a VIP and Talos 1.0.4 through Sidero. Big difference is that I'm using Kubernetes 1.21.7, so I doubt the Kubernetes version is involved.
FYI, I'm using BIOS mode instead of UEFI as a workaround for the issue I'm described here.
After changing the machineconfig wtih talosctl edit machineconfig to remove the VIP and use the nodes IP followed by talosctl reset the node was able to register and I got cilium installed. Then some pods started running. However, the cilium DaemonSet fails to start with the following error:
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s
level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s
level=fatal msg="Unable to initialize Kubernetes subsystem" error="unable to create k8s client: unable to create k8s client: Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" subsys=daemon
Checking the Kubernetes service and endpoint it is pointing to the node IP. So what's strange is that kubectl from my machine works as expected, but the cluster internal connection to the Kubernetes API seems to be broken.
Please let's not mix all issues together in a single ticket. Cilium issue above is about using cluster IP to access API server which has nothing to do with Talos itself most probably. It's CNI/kube-proxy land.
I thought it might be useful information since after the node was able to come up there still seem to be problems. For what it's worth, Talos flannel also can't connect to the kubernetes API.
My mistake, it was related to the kubeconfig kube-proxy is using pointing to the VIP, which never became available.
I've figured out what my problem was with the VIP not becoming available. I was trying to bind the VIP to eth2, while in fact the interface should have been eth2. I'll create a separate issue for this problem.