cluster-api-provider-aws [EKS] Error creating clusters with the quick start guide

[EKS] Error creating clusters with the quick start guide

Open soodr opened this issue 2 years ago • 7 comments

What steps did you take and what happened:

Created the management cluster using eksctl :

eksctl create cluster --profile aaa --name bbb

All the AWS provider steps from initialization to generating the workload clusters executed successfully. However the workload cluster is stuck in a provisioning state.

kubectl get clusters NAME PHASE AGE VERSION capi-quickstart Provisioning 179m

Environment:

Ubuntu vm on wsl on windows 10: Linux HW0KK13 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Using AWS EKS provider for hosting clusters.

kubeadm get kubeadmcontrolplane

NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION capi-quickstart-control-plane capi-quickstart 139m v1.23.3

clusterctl version: clusterctl version: &version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.3", GitCommit:"31146bd17a220ef6214c4c7a21f1aa57380b6b1f", GitTreeState:"clean", BuildDate:"2022-03-08T18:52:05Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}

kubectl logs -n capa-system capa-controller-manager-589ff74bd4-gtnst manager

I0414 23:25:49.696474 1 awsmachine_controller.go:432] "msg"="Reconciling AWSMachine" I0414 23:25:49.696845 1 awsmachine_controller.go:448] "msg"="Cluster infrastructure is not ready yet" I0414 23:25:49.697384 1 awsmachine_controller.go:432] "msg"="Reconciling AWSMachine" I0414 23:25:49.697408 1 awsmachine_controller.go:448] "msg"="Cluster infrastructure is not ready yet" E0414 23:25:49.734323 1 controller.go:317] controller/awscluster "msg"="Reconciler error" "error"="failed to create new vpc: failed to create vpc: RequestExpired: Request has expired.\n\tstatus code: 400, request id: 49625214-edd8-47d3-bf31-64e29ba9cd0a" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster" I0414 23:25:49.734681 1 awscluster_controller.go:245] controller/awscluster "msg"="Reconciling AWSCluster" "cluster"="capi-quickstart" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster"

kubectl logs -n capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-58db4b5555-x9pw9 manager

I0414 21:24:07.942724 1 request.go:665] Waited for 1.02897964s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0. 1:443/apis/apiextensions.k8s.io/v1?timeout=32s I0414 21:24:08.597219 1 logr.go:249] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="localhost:8080" I0414 21:24:08.598891 1 logr.go:249] controller-runtime/builder "msg"="skip registering a mutating webhook, object does not implement admission.Default er or WithDefaulter wasn't called" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfig"} I0414 21:24:08.599455 1 logr.go:249] controller-runtime/builder "msg"="Registering a validating webhook" "GVK"={"Group":"bootstrap.cluster.x-k8s.io"," Version":"v1beta1","Kind":"KubeadmConfig"} "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfig" I0414 21:24:08.599594 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kube admconfig" I0414 21:24:08.599773 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/convert" I0414 21:24:08.599824 1 logr.go:249] controller-runtime/builder "msg"="Conversion webhook enabled" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Versio n":"v1beta1","Kind":"KubeadmConfig"} I0414 21:24:08.599846 1 logr.go:249] controller-runtime/builder "msg"="skip registering a mutating webhook, object does not implement admission.Default er or WithDefaulter wasn't called" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfigTemplate"} I0414 21:24:08.599885 1 logr.go:249] controller-runtime/builder "msg"="Registering a validating webhook" "GVK"={"Group":"bootstrap.cluster.x-k8s.io"," Version":"v1beta1","Kind":"KubeadmConfigTemplate"} "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfigtemplate" I0414 21:24:08.599973 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kube admconfigtemplate" I0414 21:24:08.600097 1 logr.go:249] controller-runtime/builder "msg"="Conversion webhook enabled" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Versio n":"v1beta1","Kind":"KubeadmConfigTemplate"} I0414 21:24:08.600233 1 logr.go:249] setup "msg"="starting manager" "version"="v1.1.3" I0414 21:24:08.600307 1 server.go:214] controller-runtime/webhook/webhooks "msg"="Starting webhook server" I0414 21:24:08.600434 1 internal.go:362] "msg"="Starting server" "addr"={"IP":"::","Port":9440,"Zone":""} "kind"="health probe" I0414 21:24:08.600500 1 internal.go:362] "msg"="Starting server" "addr"={"IP":"127.0.0.1","Port":8080,"Zone":""} "kind"="metrics" "path"="/metrics" I0414 21:24:08.600506 1 logr.go:249] controller-runtime/certwatcher "msg"="Updated current TLS certificate" I0414 21:24:08.600609 1 logr.go:249] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=9443 I0414 21:24:08.600667 1 leaderelection.go:248] attempting to acquire leader lease capi-kubeadm-bootstrap-system/kubeadm-bootstrap-manager-leader-electi on-capi... I0414 21:24:08.600853 1 logr.go:249] controller-runtime/certwatcher "msg"="Starting certificate watcher" I0414 21:24:08.618438 1 leaderelection.go:258] successfully acquired lease capi-kubeadm-bootstrap-system/kubeadm-bootstrap-manager-leader-election-capi I0414 21:24:08.618654 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "recon ciler kind"="KubeadmConfig" "source"="kind source: *v1beta1.KubeadmConfig" I0414 21:24:08.618676 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "recon ciler kind"="KubeadmConfig" "source"="kind source: *v1beta1.Machine" I0414 21:24:08.618695 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "recon ciler kind"="KubeadmConfig" "source"="kind source: *v1beta1.Cluster" I0414 21:24:08.618708 1 controller.go:186] controller/kubeadmconfig "msg"="Starting Controller" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconc iler kind"="KubeadmConfig" I0414 21:24:08.719961 1 controller.go:220] controller/kubeadmconfig "msg"="Starting workers" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconcile r kind"="KubeadmConfig" "worker count"=10 I0414 21:40:26.996210 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="capi-quickstart-md-0-b9ddd959b-kqcmd" "namespace"="default" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "versi on"="84528" I0414 21:40:27.081743 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="capi-quickstart-md-0-b9ddd959b-kqcmd" "namespace"="default" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "versi on"="84545"

kubectl get nodes NAME STATUS ROLES AGE VERSION ip-192-168-11-254.us-east-2.compute.internal Ready 12h v1.21.5-eks-9017834 ip-192-168-36-148.us-east-2.compute.internal Ready 12h v1.21.5-eks-9017834

clusterctl describe cluster capi-quickstart

NAME READY SEVERITY REASON SINCE MESSAGE Cluster/capi-quickstart False Warning VpcReconciliationFailed 3m21s 0 of 7 completed ├─ClusterInfrastructure - AWSCluster/capi-quickstart False Warning VpcReconciliationFailed 3m21s 0 of 7 completed ├─ControlPlane - KubeadmControlPlane/capi-quickstart-control-plane └─Workers └─MachineDeployment/capi-quickstart-md-0 False Warning WaitingForAvailableMachines 3m21s Minimum availability requires 3 replicas, current 0 available └─3 Machines... False Info WaitingForClusterInfrastructure 3m21s See capi-quickstart-md-0-b9ddd959b-md84g, capi-quickstart-md-0-b9ddd959b-q7dm4, ...

kubectl get cluster capi-quickstart -o yaml

apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"cluster.x-k8s.io/v1beta1","kind":"Cluster","metadata":{"annotations":{},"name":"capi-quickstart","namespace":"default"},"spec":{"clusterNetwork":{"pods":{"cidrBlocks":["192.168.0.0/16"]}},"controlPlaneRef":{"apiVersion":"controlplane.cluster.x-k8s.io/v1beta1","kind":"KubeadmControlPlane","name":"capi-quickstart-control-plane"},"infrastructureRef":{"apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","kind":"AWSCluster","name":"capi-quickstart"}}} creationTimestamp: "2022-04-15T14:32:18Z" finalizers:

cluster.cluster.x-k8s.io generation: 1 name: capi-quickstart namespace: default resourceVersion: "5630" uid: a7223ded-6a1d-43b3-9708-441c66386fe3 spec: clusterNetwork: pods: cidrBlocks: 192.168.0.0/16 controlPlaneEndpoint: host: "" port: 0 controlPlaneRef: apiVersion: controlplane.cluster.x-k8s.io/v1beta1 kind: KubeadmControlPlane name: capi-quickstart-control-plane namespace: default infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: AWSCluster name: capi-quickstart namespace: default status: conditions: lastTransitionTime: "2022-04-15T14:32:20Z" message: 0 of 7 completed reason: VpcReconciliationFailed severity: Warning status: "False" type: Ready lastTransitionTime: "2022-04-15T14:32:20Z" message: Waiting for control plane provider to indicate the control plane has been initialized reason: WaitingForControlPlaneProviderInitialized severity: Info status: "False" type: ControlPlaneInitialized lastTransitionTime: "2022-04-15T14:32:20Z" reason: WaitingForControlPlane severity: Info status: "False" type: ControlPlaneReady lastTransitionTime: "2022-04-15T14:32:20Z" message: 0 of 7 completed reason: VpcReconciliationFailed severity: Warning status: "False" type: InfrastructureReady observedGeneration: 1 phase: Provisioning

What did you expect to happen: Workload clusters get created.

Anything else you would like to add:

Cluster-api version:

Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.9-eks-0d102a7", GitCommit:"eb09fc479c1b2bfcc35c47416efb36f1b9052d58", GitTreeState:"clean", BuildDate:"2022-02-17T16:36:28Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1

Environment:

OS (e.g. from /etc/os-release): NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.4 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels] cluster-info-dump.txt

Apr 15 '22 16:04 soodr

@soodr: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 15 '22 16:04 k8s-ci-robot

Looks like pod is failing to make SDK calls.

Have you already attached the required permissions to the control plane instances (assuming CAPA controllers are on the EKS control-plane nodes)? https://cluster-api-aws.sigs.k8s.io/topics/eks/prerequisites.html

Apr 15 '22 16:04 sedefsavas

yes. I followed the steps here. https://cluster-api.sigs.k8s.io/user/quick-start.html specifically the command

clusterawsadm bootstrap iam create-cloudformation-stack

Apr 15 '22 17:04 soodr

@soodr I had this issue, I solved this by updating my base64 encoded credentials in the secret capa-manager-bootstrap-credentials (mine had expired as we use SSO/MFA)

May 24 '22 16:05 chrism417

hello ! We also are using some "non-static-creds" way to connect to AWS (a user is doing an ADFS connection on a 1st AWS account and a role is assumed in a 2nd AWS account to create infrastructure) and I guess we have the same kind of problem with clusterAPI :

I am interested about the "clean" way to use ClusterAPI in such environments (aside from updating the capa credentials manually when the session expires ^). Is there such a way or do we have to use "static" AWS access_key/secret_key for clusterAPI to work without issue please ?

Jun 24 '22 16:06 yogeek

I came across this issue even though I have static credential.

KinD v0.12.0 on MacOS to run management cluster
clusterctl 1.2.0
clusterawsadm 1.4.1

Aug 02 '22 02:08 digihunch

I came across this issue even though I have static credential.

KinD v0.12.0 on MacOS to run management cluster

clusterctl 1.2.0

clusterawsadm 1.4.1

Do you see the same error in the issue? Because this looks like a temporary credential issue:

E0414 23:25:49.734323 1 controller.go:317] controller/awscluster "msg"="Reconciler error" "error"="failed to create new vpc: failed to create vpc: RequestExpired: Request has expired.\n\tstatus code: 400, request id: 49625214-edd8-47d3-bf31-64e29ba9cd0a" "name"="capi-quickstart" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AWSCluster"

Aug 11 '22 18:08 sedefsavas

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 09 '22 19:11 k8s-triage-robot

Hey @sedefsavas I refreshed my credentials, even after that, I see this error. Do we need to reinitialize the aws infra via clusterctl ? Seems, the capa-controller isn't picking up the new credentials. Could you share how could we ascertain this ?

Nov 09 '22 19:11 mjnovice

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Dec 09 '22 20:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jan 08 '23 20:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jan 08 '23 20:01 k8s-ci-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

[EKS] Error creating clusters with the quick start guide

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard