eks-anywhere kind: Machine stuck in Provisioned state

trafficstars

What happened: As noted in https://github.com/aws/eks-anywhere/issues/2373, in order to upgrade from older versions of EKS-A, we must step through the versions sequentially to get to the latest version. In that attempt, I am attempting an upgrade:

From: ubuntu-v1.21.2-kubernetes-1-21-eks-5-amd64 To: ubuntu-v1.21.5-eks-d-1-21-8-eks-a-6-amd64

The manifest for the bundle requires using CLI version 0.7.0: https://anywhere-assets.eks.amazonaws.com/releases/bundles/6/manifest.yaml

When attempting this, the eksctl anywhere upgrade -v=5 command is repeating on:

2022-07-12T11:16:06.882-0700    V5      Error happened during retry     {"error": "1 machine deployment replicas are unavailable", "retries": 3835}
2022-07-12T11:16:06.882-0700    V5      Sleeping before next retry      {"time": "0s"}

I can see in the bootstrap cluster that the Machine config is stuck in Provisioned status, with NodeHealthy=False:

NAMESPACE     NAME                           CLUSTER   NODENAME                       PROVIDERID                                       PHASE         AGE   VERSION
my-cluster-md-0-5d4b7ddc6f-qblrz   my-cluster                                   vsphere://1226ae6f-2713-c953-d3d6-96646c849b47   Provisioned   45m   v1.21.5-eks-1-21-8
my-cluster-md-0-7c4b58b67b-bdqsq   my-cluster    my-cluster-md-0-7c4b58b67b-bdqsq   vsphere://12260891-56a2-8373-8008-c2eb6da7691d   Running       57m   v1.21.2-eks-1-21-5

And here are the Conditions on my-cluster-md-0-5d4b7ddc6f-qblrz:

  Conditions:
    Last Transition Time:  2022-07-12T17:32:26Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2022-07-12T17:31:38Z
    Status:                True
    Type:                  BootstrapReady
    Last Transition Time:  2022-07-12T17:32:26Z
    Status:                True
    Type:                  InfrastructureReady
    Last Transition Time:  2022-07-12T17:32:26Z
    Reason:                NodeProvisioning
    Severity:              Warning
    Status:                False
    Type:                  NodeHealthy

The node itself is already in Running status from a Kubernetes perspective, and has deployed the DaemonSet pods. So Kubernetes things it's good to go, but eks-a doesn't.

And lastly, when I look at the capi-controller-manager pod, it shows:

E0712 18:03:51.762840       1 machine_controller_noderef.go:197] controller/machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "name"="my-cluster-md-0-5d4b7ddc6f-qblrz" "namespace"="eksa-system" "providerID"={} "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" "node"="/my-cluster-md-0-5d4b7ddc6f-qblrz"

Which is odd because the Machine resource clearly does have a providerID set.

What you expected to happen:

I expect the provisioned Machine to go into Running status once it's provisioned.

How to reproduce it (as minimally and precisely as possible):

Attempt an upgrade from these two: From: ubuntu-v1.21.2-kubernetes-1-21-eks-5-amd64 To: ubuntu-v1.21.5-eks-d-1-21-8-eks-a-6-amd64

Using CLI version 0.7.0

Anything else we need to know?:

Environment:

EKS Anywhere Release:
EKS Distro Release: ^ From: ubuntu-v1.21.2-kubernetes-1-21-eks-5-amd64 To: ubuntu-v1.21.5-eks-d-1-21-8-eks-a-6-amd64

Jul 12 '22 18:07 smarsh-tim

I found the problem!!

The providerID that's missing is on the kind: Node resource that's on the workload cluster - not the bootstrap cluster. So in this version upgrade something must be broken in that it is unable to write back to the workload cluster's Node resources to set the providerID.

When I manually copied the kind: Machine providerID over to the corresponding kind: Node resource - it went from Provisioned -> Running!

Jul 12 '22 18:07 smarsh-tim

Thanks for opening this issue @smarsh-tim! We will look into what could be causing this behavior

Jul 12 '22 21:07 taneyland

eks-anywhere eks-anywhere copied to clipboard

kind: Machine stuck in Provisioned state

eks-anywhere
eks-anywhere copied to clipboard