talos icon indicating copy to clipboard operation
talos copied to clipboard

Control plane node won't start with VIP and a fqdn API on Talos 1.0.4

Open szinn opened this issue 3 years ago • 20 comments

Bug Report

Control plane node fails to come up on 1.0.4 when vip is set

Description

Node is configured with stage-1.txt. Log file attached.

After the bootstrap, the machine fails to connect to the api via localhost:6443

10.0.40.64: user: warning: [2022-05-03T15:36:45.129172524Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Secret/bootstrap-token-zl18ku: Get \x5c"https://localhost:6443/api?timeout=32s\x5c": dial tcp [::1]:6443: connect: connection refused"}
10.0.40.64: user: warning: [2022-05-03T15:37:32.479313524Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\x5c"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\x5c") has prevented the request from succeeding"}

However, the fqdn of the API is staging.zinn.ca.

config.txt is a dump of

talosctl -n 10.0.40.64 read /system/secrets/kubernetes/kube-controller-manager/kubeconfig > config.txt

Note the cluster definition:

clusters:

  • name: staging cluster: server: https://localhost:6443/

I submit it should actually be the fqdn of the API

If I reboot the machine and let it come up with 1.0.1, it is correct and the node fully comes online. Rebooting with 1.0.4 and it doesn't start up.

Logs

stage-1.txt log.txt config.txt

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
$ talosctl version -n 10.0.40.64
Client:
	Tag:         v1.0.4
	SHA:         f6696063
	Built:
	Go version:  go1.17.7
	OS/Arch:     darwin/arm64
Server:
	NODE:        10.0.40.64
	Tag:         v1.0.4
	SHA:         f6696063
	Built:
	Go version:  go1.17.7
	OS/Arch:     linux/amd64
	Enabled:     RBAC
  • Kubernetes version: [kubectl version --short] Not booted up

  • Platform: Proxmox VM (4Gb RAM, 20Gb disk, 8 cores)

szinn avatar May 03 '22 15:05 szinn

Rebooted with 1.0.1 and the node came up - however, /system/secrets/kubernetes/kube-controller-manager/kubeconfig still referred to localhost

szinn avatar May 03 '22 15:05 szinn

The fact that it refers to localhost is expected - control plane nodes talk directly to the local API server.

smira avatar May 03 '22 17:05 smira

Looking at the log.txt - it feels like everything on Talos side worked correctly. I don't see any errors or anything which would point at the problem. The errors printed are expected until the control plane is up - Talos reconciles the state and retries all transient errors.

Talos wrote down static pod definitions to disk for the control plane components, kubelet was up, so it should have picked them up and started the pods. But looks like pods don't start for whatever reason.

In order to investigate further, we can do:

talosctl c -k
talosctl logs kubelet

If the API server fails to start, logs can be found manually in /var/log/pods/... on the node using talosctl ls and talosctl read

smira avatar May 03 '22 17:05 smira

$ talosctl c -k -n 10.0.40.64 NODE NAMESPACE ID IMAGE PID STATUS

kubelet.log

/var/log/pods is empty

szinn avatar May 03 '22 17:05 szinn

That is strange, just to double check, are the files visible in /etc/kubernetes/manifests?

smira avatar May 03 '22 18:05 smira

$ talosctl -n 10.0.40.64 ls /etc/kubernetes/manifests/ NODE NAME 10.0.40.64 . 10.0.40.64 talos-kube-apiserver.yaml 10.0.40.64 talos-kube-controller-manager.yaml 10.0.40.64 talos-kube-scheduler.yaml

szinn avatar May 03 '22 18:05 szinn

I have no good idea, unless it's something within the kubelet. My only guess is trying to roll back kubelet to 1.23.5 (this can be done on the live node with talosctl edit mc and updating the kubelet image reference).

smira avatar May 03 '22 18:05 smira

can you check the logs from /var/log/containers, there should be one for apiserver if it ever started

frezbo avatar May 03 '22 18:05 frezbo

But it should have nothing to do with the control plane endpoint or the VIP. Static pods should work no matter what is the API server connection state.

smira avatar May 03 '22 18:05 smira

Rolling kubelet back to 1.23.5 unblocked it and node came up. It was 1.23.6 previously

szinn avatar May 03 '22 18:05 szinn

Ok, interesting. I don't see anything in the changelog which would immediately suggest what the issue might be: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.23.md#changelog-since-v1235

smira avatar May 03 '22 18:05 smira

Also, if I do talosctl reset -n 10.0.40.64, the process gets bogged down because the registry node watch fails since the VIP address has already been taken down. The machine needs to be manually rebooted.

So, I changed the kubelet version in the config that I applied via apply-config to 1.23.5 and it got hung up again. edit machineconfig and changed it to 1.23.6 unblocked it and node came up.

szinn avatar May 03 '22 18:05 szinn

I think kubelet version might be red herring, but rather it's something with the kubelet not picking up the static pod definition and restarting the kubelet fixes that.

talosctl reset is known to be weird when the node is not healthy, so it needs probably --graceful=false for now

smira avatar May 03 '22 18:05 smira

Agreed on the red herring and also that kubelet is perhaps getting started before the static pod definition is fully written? Restarting it picks up the updated info and all happens fine after that.

szinn avatar May 03 '22 18:05 szinn

kubelet is supposed to watch that directory, so it shouldn't matter when it gets written, but you might have uncovered some bug.

smira avatar May 03 '22 18:05 smira

@smira I've also hit this issue when using a VIP and Talos 1.0.4 through Sidero. Big difference is that I'm using Kubernetes 1.21.7, so I doubt the Kubernetes version is involved.

FYI, I'm using BIOS mode instead of UEFI as a workaround for the issue I'm described here.

davidspek avatar Jun 01 '22 19:06 davidspek

After changing the machineconfig wtih talosctl edit machineconfig to remove the VIP and use the nodes IP followed by talosctl reset the node was able to register and I got cilium installed. Then some pods started running. However, the cilium DaemonSet fails to start with the following error:

level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s
level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s
level=fatal msg="Unable to initialize Kubernetes subsystem" error="unable to create k8s client: unable to create k8s client: Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" subsys=daemon

Checking the Kubernetes service and endpoint it is pointing to the node IP. So what's strange is that kubectl from my machine works as expected, but the cluster internal connection to the Kubernetes API seems to be broken.

davidspek avatar Jun 01 '22 19:06 davidspek

Please let's not mix all issues together in a single ticket. Cilium issue above is about using cluster IP to access API server which has nothing to do with Talos itself most probably. It's CNI/kube-proxy land.

smira avatar Jun 01 '22 19:06 smira

I thought it might be useful information since after the node was able to come up there still seem to be problems. For what it's worth, Talos flannel also can't connect to the kubernetes API.

My mistake, it was related to the kubeconfig kube-proxy is using pointing to the VIP, which never became available.

davidspek avatar Jun 01 '22 19:06 davidspek

I've figured out what my problem was with the VIP not becoming available. I was trying to bind the VIP to eth2, while in fact the interface should have been eth2. I'll create a separate issue for this problem.

davidspek avatar Jun 01 '22 20:06 davidspek