talos
talos copied to clipboard
Upgrade Docs for 1.6 should mention change to kubeprism endpoint
Bug Report
Description
Because of the change in https://github.com/siderolabs/talos/commit/f70b47dddc2599a618c68d8b403d9b37c61f2b71, if you upgrade a worker node from before to after this commit with default kubeprism and certSANs settings, the node will fail to come up fully because the worker node will now be expecting the api server to have a cert signed for 127.0.0.1, but if the controlplane nodes haven't been upgraded yet (and their cluster.apiServer.certSANs
doesn't explicitly contain 127.0.0.1
), then the cert will not have 127.0.0.1
as a SAN and so the worker node will fail to communicate with the api server at all and not be able to do much. The release notes for the first release to contain this commit and the docs for upgrading the 1.6 series should probably at the very least contain notes about this change so people can be aware of and workaround this issue.
Logs
10.16.2.201: {"ts":1708403433539.275,"caller":"cache/reflector.go:147","msg":"vendor/k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://127.0.0.1:7445/api/v1/services?limit=500&resourceVersion=0\": tls: failed to verify certificate: x509: certificate is valid for 10.16.2.10, 10.16.2.101, 172.17.0.1, not 127.0.0.1"}
Environment
- Talos version (pre-upgrade control-plane):
Client:
Tag: v1.6.4
SHA: 431bcada
Built:
Go version: go1.21.6 X:loopvar
OS/Arch: linux/amd64
Server:
NODE: 10.16.2.101
Tag: v1.6.0
SHA: eddd188c
Built:
Go version: go1.21.5 X:loopvar
OS/Arch: linux/arm64
Enabled: RBAC
- Talos version (post-upgrade worker that couldn't connect):
Client:
Tag: v1.6.4
SHA: 431bcada
Built:
Go version: go1.21.6 X:loopvar
OS/Arch: linux/amd64
Server:
NODE: 10.16.2.201
Tag: v1.6.4
SHA: 431bcada
Built:
Go version: go1.21.6 X:loopvar
OS/Arch: linux/arm64
Enabled: RBAC
- Kubernetes version:
Client Version: v1.28.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0
- Platform:
metal
This was an unfortunate change in the point release, but we had to make it.
Please follow the recommended upgrade sequence - control plane nodes first, then workers, this always works as expected.
Actually, that is something that does need documentation - as far as I can tell, we do not document explicitly that Talos upgrades should be done on control plane nodes first. (We do for K8s upgrades.)
I'll add that..