eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

[Bug] Karpenter v0.32.4 does not work when deployed via eksctl

Open iankouls-aws opened this issue 1 year ago • 10 comments

Summary: Karpenter deployment is successful but it fails to create new nodes

What were you trying to accomplish?

An EKS cluster was created using eksctl version 0.167.0 using the following manifest:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: do-eks-yaml-karpenter
  version: "1.28"
  region: us-west-2
  tags:
    karpenter.sh/discovery: do-eks-yaml-karpenter

iam:
  withOIDC: true

addons:
  - name: aws-ebs-csi-driver
    version: v1.26.0-eksbuild.1
    wellKnownPolicies:
      ebsCSIController: true

karpenter:
  version: 'v0.32.4'
  createServiceAccount: true
  #  defaultInstanceProfile: 'KarpenterInstanceProfile'
  withSpotInterruptionQueue: true

managedNodeGroups:
  - name: c5-xl-do-eks-karpenter-ng
    instanceType: c5.xlarge
    instancePrefix: c5-xl
    privateNetworking: true
    minSize: 0
    desiredCapacity: 2
    maxSize: 10
    volumeSize: 300
    iam:
      withAddonPolicies:
        cloudWatch: true
        ebs: true

eksctl create cluster -f ./eks-karpenter.yaml

The cluster creation finishes successfully. See logs below.

Apply NodePool and EC2NodeClass, then create a deployment that requires a GPU. The pod enters Pending state. It is expected that karpenter will add a GPU node to the cluster

What happened?

No nodes get added to the cluster Karpenter pods are in the Running state Karpenter pod logs show errors:

[karpenter-84bf6fff97-v5v2k] {"level":"ERROR","time":"2024-01-06T09:15:56.457Z","logger":"controller","message":"Reconciler error","commit":"fdf67d0","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"default"},"namespace":"","name":"default","reconcileID":"fe2de351-d378-4d82-aff7-556160f4d128","error":"creating instance profile, getting instance profile "do-eks-yaml-karpenter_4067990795380418201", AccessDenied: User: arn:aws:sts::<account_id>:assumed-role/eksctl-do-eks-yaml-karpenter-iamservice-role/1704531887056119458 is not authorized to perform: iam:GetInstanceProfile on resource: instance profile do-eks-yaml-karpenter_4067990795380418201 because no identity-based policy allows the iam:GetInstanceProfile action\n\tstatus code: 403, request id: f3a80d84-31cc-44ad-a6a4-91b4d3e56de3"}

How to reproduce it?

  1. Create cluster using the manifest shared above and command eksctl create cluster -f ./eks-karpenter.yaml
  2. Create NodePool and EC2NodeClass by cloning project https://github.com/aws-samples/aws-do-eks, and executing script Container-Root/eks/deployment/karpenter/provisioner-deploy-v1beta1.sh
  3. Create deployment whith requests and limits of 1 nvdia.com/gpu by running script https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/horizontal-pod-autoscaler/hpa-example/run.sh
  4. Tail karpenter pod logs: kubectl -n karpenter logs -f $(kubectl -n karpenter get pod | grep karpenter | head -n 1 | cut -d ' ' -f 1)

Logs Cluster creation log:

eksctl create cluster -f /aws-do-eks/Container-Root/eks/conf/eksctl/yaml/eks-karpenter.yaml

2024-01-06 05:42:46 [ℹ]  eksctl version 0.167.0
2024-01-06 05:42:46 [ℹ]  using region us-west-2
2024-01-06 05:42:47 [ℹ]  setting availability zones to [us-west-2b us-west-2a us-west-2c]
2024-01-06 05:42:47 [ℹ]  subnets for us-west-2b - public:192.168.0.0/19 private:192.168.96.0/19
2024-01-06 05:42:47 [ℹ]  subnets for us-west-2a - public:192.168.32.0/19 private:192.168.128.0/19
2024-01-06 05:42:47 [ℹ]  subnets for us-west-2c - public:192.168.64.0/19 private:192.168.160.0/19
2024-01-06 05:42:47 [ℹ]  nodegroup "c5-xl-do-eks-karpenter-ng" will use "" [AmazonLinux2/1.28]
2024-01-06 05:42:47 [ℹ]  using Kubernetes version 1.28
2024-01-06 05:42:47 [ℹ]  creating EKS cluster "do-eks-yaml-karpenter" in "us-west-2" region with managed nodes
2024-01-06 05:42:47 [ℹ]  1 nodegroup (c5-xl-do-eks-karpenter-ng) was included (based on the include/exclude rules)
2024-01-06 05:42:47 [ℹ]  will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
2024-01-06 05:42:47 [ℹ]  will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s)
2024-01-06 05:42:47 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=do-eks-yaml-karpenter'
2024-01-06 05:42:47 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "do-eks-yaml-karpenter" in "us-west-2"
2024-01-06 05:42:47 [ℹ]  CloudWatch logging will not be enabled for cluster "do-eks-yaml-karpenter" in "us-west-2"
2024-01-06 05:42:47 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=do-eks-yaml-karpenter'
2024-01-06 05:42:47 [ℹ]  
2 sequential tasks: { create cluster control plane "do-eks-yaml-karpenter", 
    2 sequential sub-tasks: { 
        5 sequential sub-tasks: { 
            wait for control plane to become ready,
            associate IAM OIDC provider,
            2 sequential sub-tasks: { 
                create IAM role for serviceaccount "kube-system/aws-node",
                create serviceaccount "kube-system/aws-node",
            },
            restart daemonset "kube-system/aws-node",
            1 task: { create addons },
        },
        create managed nodegroup "c5-xl-do-eks-karpenter-ng",
    } 
}
2024-01-06 05:42:47 [ℹ]  building cluster stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:42:47 [ℹ]  deploying stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:43:17 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:43:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:44:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:45:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:46:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:47:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:48:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:49:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:50:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:51:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:52:47 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-cluster"
2024-01-06 05:54:48 [ℹ]  building iamserviceaccount stack "eksctl-do-eks-yaml-karpenter-addon-iamserviceaccount-kube-system-aws-node"
2024-01-06 05:54:48 [ℹ]  deploying stack "eksctl-do-eks-yaml-karpenter-addon-iamserviceaccount-kube-system-aws-node"
2024-01-06 05:54:48 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-addon-iamserviceaccount-kube-system-aws-node"
2024-01-06 05:55:19 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-addon-iamserviceaccount-kube-system-aws-node"
2024-01-06 05:55:19 [ℹ]  serviceaccount "kube-system/aws-node" already exists
2024-01-06 05:55:19 [ℹ]  updated serviceaccount "kube-system/aws-node"
2024-01-06 05:55:19 [ℹ]  daemonset "kube-system/aws-node" restarted
2024-01-06 05:55:19 [ℹ]  building managed nodegroup stack "eksctl-do-eks-yaml-karpenter-nodegroup-c5-xl-do-eks-karpenter-ng"
2024-01-06 05:55:19 [ℹ]  deploying stack "eksctl-do-eks-yaml-karpenter-nodegroup-c5-xl-do-eks-karpenter-ng"
2024-01-06 05:55:19 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-nodegroup-c5-xl-do-eks-karpenter-ng"
2024-01-06 05:55:49 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-nodegroup-c5-xl-do-eks-karpenter-ng"
2024-01-06 05:56:45 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-nodegroup-c5-xl-do-eks-karpenter-ng"
2024-01-06 05:58:03 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-nodegroup-c5-xl-do-eks-karpenter-ng"
2024-01-06 05:58:03 [ℹ]  waiting for the control plane to become ready
2024-01-06 05:58:04 [✔]  saved kubeconfig as "/root/.kube/config"
2024-01-06 05:58:04 [ℹ]  no tasks
2024-01-06 05:58:04 [✔]  all EKS cluster resources for "do-eks-yaml-karpenter" have been created
2024-01-06 05:58:04 [ℹ]  creating role using provided well known policies
2024-01-06 05:58:05 [ℹ]  deploying stack "eksctl-do-eks-yaml-karpenter-addon-aws-ebs-csi-driver"
2024-01-06 05:58:05 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-addon-aws-ebs-csi-driver"
2024-01-06 05:58:35 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-addon-aws-ebs-csi-driver"
2024-01-06 05:59:34 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-addon-aws-ebs-csi-driver"
2024-01-06 05:59:34 [ℹ]  creating addon
2024-01-06 06:00:23 [ℹ]  addon "aws-ebs-csi-driver" active
2024-01-06 06:00:24 [ℹ]  1 task: { create karpenter for stack "do-eks-yaml-karpenter" }
2024-01-06 06:00:24 [ℹ]  building nodegroup stack "eksctl-do-eks-yaml-karpenter-karpenter"
2024-01-06 06:00:24 [ℹ]  deploying stack "eksctl-do-eks-yaml-karpenter-karpenter"
2024-01-06 06:00:24 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-karpenter"
2024-01-06 06:00:54 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-karpenter"
2024-01-06 06:01:44 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-karpenter"
2024-01-06 06:02:16 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-karpenter"
2024-01-06 06:02:52 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-karpenter"
2024-01-06 06:04:04 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-karpenter"
2024-01-06 06:04:04 [ℹ]  1 task: { create IAM role for serviceaccount "karpenter/karpenter" }
2024-01-06 06:04:04 [ℹ]  1 task: { create IAM role for serviceaccount "karpenter/karpenter" }
2024-01-06 06:04:04 [ℹ]  building iamserviceaccount stack "eksctl-do-eks-yaml-karpenter-addon-iamserviceaccount-karpenter-karpenter"
2024-01-06 06:04:04 [ℹ]  deploying stack "eksctl-do-eks-yaml-karpenter-addon-iamserviceaccount-karpenter-karpenter"
2024-01-06 06:04:04 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-addon-iamserviceaccount-karpenter-karpenter"
2024-01-06 06:04:34 [ℹ]  waiting for CloudFormation stack "eksctl-do-eks-yaml-karpenter-addon-iamserviceaccount-karpenter-karpenter"
2024-01-06 06:04:34 [ℹ]  adding identity "arn:aws:iam::159553542841:role/eksctl-KarpenterNodeRole-do-eks-yaml-karpenter" to auth ConfigMap
2024-01-06 06:04:34 [ℹ]  adding Karpenter to cluster do-eks-yaml-karpenter
E0106 06:04:35.661520    1614 memcache.go:206] couldn't get resource list for karpenter.k8s.aws/v1alpha1: the server could not find the requested resource
E0106 06:04:35.732564    1614 memcache.go:206] couldn't get resource list for karpenter.k8s.aws/v1beta1: the server could not find the requested resource
E0106 06:04:35.821871    1614 memcache.go:206] couldn't get resource list for karpenter.sh/v1beta1: the server could not find the requested resource
2024-01-06 06:04:50 [ℹ]  kubectl command should work with "/root/.kube/config", try 'kubectl get nodes'
2024-01-06 06:04:50 [✔]  EKS cluster "do-eks-yaml-karpenter" in "us-west-2" region is ready

Sat Jan  6 06:04:50 UTC 2024
Done creating cluster using /aws-do-eks/Container-Root/eks/conf/eksctl/yaml/eks-karpenter.yaml
/aws-do-eks/Container-Root/eks

Karpenter pod log:

[karpenter-84bf6fff97-v5v2k] {"level":"DEBUG","time":"2024-01-06T09:51:07.220Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fdf67d0"}
[karpenter-84bf6fff97-v5v2k] {"level":"DEBUG","time":"2024-01-06T09:51:08.221Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fdf67d0"}
[karpenter-84bf6fff97-v5v2k] {"level":"DEBUG","time":"2024-01-06T09:51:09.221Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fdf67d0"}
[karpenter-84bf6fff97-v5v2k] {"level":"DEBUG","time":"2024-01-06T09:51:10.222Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fdf67d0"}
[karpenter-84bf6fff97-v5v2k] {"level":"DEBUG","time":"2024-01-06T09:51:11.222Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fdf67d0"}
[karpenter-84bf6fff97-v5v2k] {"level":"DEBUG","time":"2024-01-06T09:51:12.222Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fdf67d0"}
[karpenter-84bf6fff97-v5v2k] {"level":"DEBUG","time":"2024-01-06T09:51:13.223Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fdf67d0"}
[karpenter-84bf6fff97-v5v2k] {"level":"DEBUG","time":"2024-01-06T09:51:14.224Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"fdf67d0"}
[karpenter-84bf6fff97-v5v2k] {"level":"ERROR","time":"2024-01-06T09:51:14.749Z","logger":"controller","message":"Reconciler error","commit":"fdf67d0","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"default"},"namespace":"","name":"default","reconcileID":"4020d4ad-afda-4687-9f05-6ed98b8506f3","error":"creating instance profile, getting instance profile \"do-eks-yaml-karpenter_4067990795380418201\", AccessDenied: User: arn:aws:sts::159553542841:assumed-role/eksctl-do-eks-yaml-karpenter-iamservice-role/1704531887056119458 is not authorized to perform: iam:GetInstanceProfile on resource: instance profile do-eks-yaml-karpenter_4067990795380418201 because no identity-based policy allows the iam:GetInstanceProfile action\n\tstatus code: 403, request id: 63aaf857-2b29-45d2-a60e-e6e1fe9889a2"}

Anything else we need to know? To build the container image for the deployment, from the cloned project directory, execute the following commands: cd Container-Root/eks/deployment/horizontal-pod-autoscaler/hpa-example ./build.sh ./push.sh

Older versions of Karpenter (e.g. 0.29.0) used with Provisioner and AWSNodeTemplate work as expected. In this case the v1alpha5 API is used: https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/karpenter/provisioner-deploy-v1alpha5.sh

Karpenter works as expected, when the cluster is created without Karpenter, then Karpenter v0.32.4 is deployed by following the instructions here: https://karpenter.sh/v0.32/getting-started/getting-started-with-karpenter/#4-install-karpenter

It appears like eksctl lacks support for the versions of Karpenter that support API v1beta1.

Versions

$ eksctl info
eksctl version: 0.167.0
kubectl version: v1.28.2
OS: linux

iankouls-aws avatar Jan 06 '24 10:01 iankouls-aws

Hello iankouls-aws :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

github-actions[bot] avatar Jan 06 '24 10:01 github-actions[bot]

https://github.com/eksctl-io/eksctl/blob/9575570b554610129382d0a181645a3806cea98f/pkg/cfn/builder/karpenter.go#L151-L169

Looks like the controller policy is missing the AllowPassingInstanceRole defined here:

https://github.com/aws/karpenter-provider-aws/blob/daeb5da355fce14f718f51c4956ca8f9319103dd/website/content/en/docs/getting-started/getting-started-with-karpenter/cloudformation.yaml#L186-L196

yuxiang-zhang avatar Jan 25 '24 22:01 yuxiang-zhang

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 25 '24 01:02 github-actions[bot]

Hi @yuxiang-zhang . Can you please assign this issue to me.

ibnjunaid avatar Feb 27 '24 05:02 ibnjunaid

I fixed this issue in my cluster by manually adding these permissions to eksctl-KarpenterControllerPolicy-CLUSTERNAME policy:

  • iam:GetInstanceProfile
  • iam:CreateInstanceProfile
  • iam:TagInstanceProfile
  • iam:AddRoleToInstanceProfile

These are apparently missing when configuring Karpenter by eksctl.

pstast avatar Mar 28 '24 14:03 pstast

Thanks @pstast for the policies. I am yet to raise a PR for the issue.

ibnjunaid avatar Mar 28 '24 15:03 ibnjunaid

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 28 '24 01:04 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar May 04 '24 01:05 github-actions[bot]

I ran into this issue with 0.36.2 as well, and @pstast's recommended fix resolved it for me.

siennathesane avatar May 23 '24 02:05 siennathesane

Still there with karpenter 0.37.0 and eksctl 0.183.0

piotrblasiak avatar Jun 18 '24 18:06 piotrblasiak