talos icon indicating copy to clipboard operation
talos copied to clipboard

Upgrade Docs for 1.6 should mention change to kubeprism endpoint

Open Pythoner6 opened this issue 1 year ago • 3 comments

Bug Report

Description

Because of the change in https://github.com/siderolabs/talos/commit/f70b47dddc2599a618c68d8b403d9b37c61f2b71, if you upgrade a worker node from before to after this commit with default kubeprism and certSANs settings, the node will fail to come up fully because the worker node will now be expecting the api server to have a cert signed for 127.0.0.1, but if the controlplane nodes haven't been upgraded yet (and their cluster.apiServer.certSANs doesn't explicitly contain 127.0.0.1), then the cert will not have 127.0.0.1 as a SAN and so the worker node will fail to communicate with the api server at all and not be able to do much. The release notes for the first release to contain this commit and the docs for upgrading the 1.6 series should probably at the very least contain notes about this change so people can be aware of and workaround this issue.

Logs

10.16.2.201: {"ts":1708403433539.275,"caller":"cache/reflector.go:147","msg":"vendor/k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://127.0.0.1:7445/api/v1/services?limit=500&resourceVersion=0\": tls: failed to verify certificate: x509: certificate is valid for 10.16.2.10, 10.16.2.101, 172.17.0.1, not 127.0.0.1"}

Environment

  • Talos version (pre-upgrade control-plane):
Client:
	Tag:         v1.6.4
	SHA:         431bcada
	Built:       
	Go version:  go1.21.6 X:loopvar
	OS/Arch:     linux/amd64
Server:
	NODE:        10.16.2.101
	Tag:         v1.6.0
	SHA:         eddd188c
	Built:       
	Go version:  go1.21.5 X:loopvar
	OS/Arch:     linux/arm64
	Enabled:     RBAC
  • Talos version (post-upgrade worker that couldn't connect):
Client:
	Tag:         v1.6.4
	SHA:         431bcada
	Built:       
	Go version:  go1.21.6 X:loopvar
	OS/Arch:     linux/amd64
Server:
	NODE:        10.16.2.201
	Tag:         v1.6.4
	SHA:         431bcada
	Built:       
	Go version:  go1.21.6 X:loopvar
	OS/Arch:     linux/arm64
	Enabled:     RBAC
  • Kubernetes version:
Client Version: v1.28.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0
  • Platform: metal

Pythoner6 avatar Feb 20 '24 04:02 Pythoner6

This was an unfortunate change in the point release, but we had to make it.

Please follow the recommended upgrade sequence - control plane nodes first, then workers, this always works as expected.

smira avatar Feb 20 '24 09:02 smira

Actually, that is something that does need documentation - as far as I can tell, we do not document explicitly that Talos upgrades should be done on control plane nodes first. (We do for K8s upgrades.)

I'll add that..

steverfrancis avatar Feb 20 '24 15:02 steverfrancis