eks-anywhere
eks-anywhere copied to clipboard
kind: Machine stuck in Provisioned state
What happened: As noted in https://github.com/aws/eks-anywhere/issues/2373, in order to upgrade from older versions of EKS-A, we must step through the versions sequentially to get to the latest version. In that attempt, I am attempting an upgrade:
From: ubuntu-v1.21.2-kubernetes-1-21-eks-5-amd64 To: ubuntu-v1.21.5-eks-d-1-21-8-eks-a-6-amd64
The manifest for the bundle requires using CLI version 0.7.0: https://anywhere-assets.eks.amazonaws.com/releases/bundles/6/manifest.yaml
When attempting this, the eksctl anywhere upgrade -v=5 command is repeating on:
2022-07-12T11:16:06.882-0700 V5 Error happened during retry {"error": "1 machine deployment replicas are unavailable", "retries": 3835}
2022-07-12T11:16:06.882-0700 V5 Sleeping before next retry {"time": "0s"}
I can see in the bootstrap cluster that the Machine config is stuck in Provisioned status, with NodeHealthy=False:
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
my-cluster-md-0-5d4b7ddc6f-qblrz my-cluster vsphere://1226ae6f-2713-c953-d3d6-96646c849b47 Provisioned 45m v1.21.5-eks-1-21-8
my-cluster-md-0-7c4b58b67b-bdqsq my-cluster my-cluster-md-0-7c4b58b67b-bdqsq vsphere://12260891-56a2-8373-8008-c2eb6da7691d Running 57m v1.21.2-eks-1-21-5
And here are the Conditions on my-cluster-md-0-5d4b7ddc6f-qblrz:
Conditions:
Last Transition Time: 2022-07-12T17:32:26Z
Status: True
Type: Ready
Last Transition Time: 2022-07-12T17:31:38Z
Status: True
Type: BootstrapReady
Last Transition Time: 2022-07-12T17:32:26Z
Status: True
Type: InfrastructureReady
Last Transition Time: 2022-07-12T17:32:26Z
Reason: NodeProvisioning
Severity: Warning
Status: False
Type: NodeHealthy
The node itself is already in Running status from a Kubernetes perspective, and has deployed the DaemonSet pods. So Kubernetes things it's good to go, but eks-a doesn't.
And lastly, when I look at the capi-controller-manager pod, it shows:
E0712 18:03:51.762840 1 machine_controller_noderef.go:197] controller/machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "name"="my-cluster-md-0-5d4b7ddc6f-qblrz" "namespace"="eksa-system" "providerID"={} "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" "node"="/my-cluster-md-0-5d4b7ddc6f-qblrz"
Which is odd because the Machine resource clearly does have a providerID set.
What you expected to happen:
I expect the provisioned Machine to go into Running status once it's provisioned.
How to reproduce it (as minimally and precisely as possible):
Attempt an upgrade from these two: From: ubuntu-v1.21.2-kubernetes-1-21-eks-5-amd64 To: ubuntu-v1.21.5-eks-d-1-21-8-eks-a-6-amd64
Using CLI version 0.7.0
Anything else we need to know?:
Environment:
- EKS Anywhere Release:
- EKS Distro Release: ^ From: ubuntu-v1.21.2-kubernetes-1-21-eks-5-amd64 To: ubuntu-v1.21.5-eks-d-1-21-8-eks-a-6-amd64
I found the problem!!
The providerID that's missing is on the kind: Node resource that's on the workload cluster - not the bootstrap cluster. So in this version upgrade something must be broken in that it is unable to write back to the workload cluster's Node resources to set the providerID.
When I manually copied the kind: Machine providerID over to the corresponding kind: Node resource - it went from Provisioned -> Running!
Thanks for opening this issue @smarsh-tim! We will look into what could be causing this behavior