eks-anywhere
eks-anywhere copied to clipboard
upgrade kubernetes version of EKSA cluster for bare metal with 2 CP nodes (1 used + 1 idle) doesn't work
What happened: It's ok for me to upgrade a cluster with 4 CP nodes (3 used + 1 idle).
However, when I try to upgrade a cluster with 2 CP nodes (1 used + 1 idle), the upgrade stuck after the idle node is "Provisioned":
armada@admin-machine2:~/eksa/mgmt02$ kubectl get workflow -A -o wide
NAMESPACE NAME TEMPLATE STATE
eksa-system mgmt02-standalone2-control-plane-template-1710122858425-44xpk mgmt02-standalone2-control-plane-template-1710122858425-44xpk STATE_SUCCESS
armada@admin-machine2:~/eksa/mgmt02$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
eksa-control-02 Ready control-plane 5h16m v1.26.14 10.20.22.224 <none> Ubuntu 20.04.6 LTS 5.4.0-172-generic containerd://1.7.10
eksa-wk-650-01 Ready <none> 4h56m v1.26.14 10.20.22.227 <none> Ubuntu 20.04.6 LTS 5.4.0-172-generic containerd://1.7.10
armada@admin-machine2:~/eksa/mgmt02$ kubectl get machines.cluster.x-k8s.io -A -o wide
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
eksa-system mgmt02-standalone2-b94wm mgmt02-standalone2 eksa-control-02 tinkerbell://eksa-system/eksa-control-02 Running 4h54m v1.26.10-eks-1-26-21
eksa-system mgmt02-standalone2-md-0-dvxsx-ntvph mgmt02-standalone2 eksa-wk-650-01 tinkerbell://eksa-system/eksa-wk-650-01 Running 4h54m v1.26.10-eks-1-26-21
eksa-system mgmt02-standalone2-p8fwl mgmt02-standalone2 tinkerbell://eksa-system/eksa-main-control-01 Provisioned 45m v1.27.7-eks-1-27-15
An smoking issue after the idle node is "Provisioned" is that, when I run above "kubectl get ..." commands, I may see such error message:
...
E0311 07:45:31.895702 953673 memcache.go:265] couldn't get current server API group list: Get "https://10.20.22.222:6443/api?timeout=32s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- EKS Anywhere Release: v0.18.2
- EKS Distro Release: 1.26/1.27
@ygao-armada, ideally there shouldn't be a difference due to the number of CPs; am curious if the node is in idle stage due to no resource or some event, or something else ?
@ndeksa sorry, I should have used "spare" instead of "idle".