eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

[Bug] kube-proxy image version 1.27 causing the kube-proxy to fail

Open artemisia480 opened this issue 1 year ago • 12 comments

What were you trying to accomplish?

Trying to deploy a new cluster, version 1.27, using eksctl. i am running the command: eksctl create cluster...

What happened?

I get the following error and the nodes for the cluster never come up. Looking at the logs inside the node, I see this error: ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests v1.27.1-minimal-eksbuild.1]

How to reproduce it?

I am using a yaml file to deploy this. Not sure how you would reproduce it. But if you look at the aws documentation here: https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html the image is meant to be eksbuild.2 and not 1. and if you look at the eksctl code here: https://github.com/eksctl-io/eksctl/blob/c27d2e80f50aceb78c35c60b713f8e9267611dde/pkg/addons/default/kube_proxy.go#L150C1-L151 it is only calling eksbuild.1 and not 2.

Logs ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests v1.27.1-minimal-eksbuild.1]

Anything else we need to know?

Versions 1.27

$ eksctl info

artemisia480 avatar Aug 21 '23 14:08 artemisia480

Hello artemisia480 :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

github-actions[bot] avatar Aug 21 '23 14:08 github-actions[bot]

Thanks @artemisia480 same problem here.

yoplait avatar Aug 21 '23 17:08 yoplait

i am running the command: eksctl create cluster...

@artemisia480 did you run any commands after eksctl create cluster, or did you try to update the image?

and if you look at the eksctl code here: https://github.com/eksctl-io/eksctl/blob/c27d2e80f50aceb78c35c60b713f8e9267611dde/pkg/addons/default/kube_proxy.go#L150C1-L151 it is only calling eksbuild.1 and not 2.

That codepath is not used in eksctl create cluster.

cPu1 avatar Aug 22 '23 08:08 cPu1

I'm unable to reproduce this. I got the same image tag (602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1) on a new cluster and it was pulled successfully.

Can you share your config file?

cPu1 avatar Aug 22 '23 08:08 cPu1

@cPu1 , the code doesn't use it? are you sure? but the aws documentation says to use eksbuild.2 and clearly this pulls 1. here is my yaml file:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ami-testing-cluster2
  version: "1.27"
  region: us-east-1

vpc:
  clusterEndpoints:
    publicAccess: true
    privateAccess: false

managedNodeGroups:
  - name: ami-testing2
    ami:  <custome ami>
    amiFamily: AmazonLinux2
    instanceType: m6i.large
    volumeSize: 20
    disableIMDSv1: false
    ssh:
      allow: true
      publicKeyPath: ~/.ssh/id_rsa.pub
    overrideBootstrapCommand: |
      #!/bin/bash
      eks_register.sh ami-testing-cluster2
    iam:
      withAddonPolicies:
        externalDNS: true
        ebs: true
        autoScaler: true
        cloudWatch: false
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

artemisia480 avatar Aug 22 '23 11:08 artemisia480

Could this be an issue in a specific region? @artemisia480 do you have any clusters in other regions to confirm this?

a-hilaly avatar Aug 23 '23 13:08 a-hilaly

@a-hilaly not sure why it would be region specific? But I can test a different region just to see.

artemisia480 avatar Aug 24 '23 10:08 artemisia480

@artemisia480 not really sure, but if it's a pull issue, maybe the image is not available in every region. Or are we using ECR public here? i'll try to replicate the same bug locally and update here.

a-hilaly avatar Aug 24 '23 11:08 a-hilaly

@artemisia480 i haven't been able to reproduce your issue through 4/5 creations in different regions... maybe this is an issue with the custom AMI?

a-hilaly avatar Aug 24 '23 19:08 a-hilaly

@a-hilaly thanks for testing that! I am starting to think it is the customer AMI after all. i am not sure what though. I had the following flags in the AMI for 1.26, which I have removed now for 1.27: KUBELET_EKS_ARGS=--node-ip=192.168.22.222
--pod-infra-container-image=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1
--cloud-provider aws
--config /etc/kubernetes/kubelet.json
--kubeconfig /etc/kubernetes/kubeconfig
--container-runtime remote
--container-runtime-endpoint unix:///var/run/containerd/containerd.sock

I also added the flag: --seccomp-default=unconfined.

But having no luck.

artemisia480 avatar Aug 25 '23 10:08 artemisia480

Do you run any extra commands after creating the cluster? any daemonset updates?

a-hilaly avatar Sep 05 '23 16:09 a-hilaly

@artemisia480 I got a similar error when I added a containerd node group to an eks 1.23 cluster. The containerd nodes could not pull ECR image and reported the pull failed error. But the dockerd nodes in the same cluster could pull the exact same image. My test cluster was in a VPC that did not have an ECR endpoint, in case that is relevant.

There seems to be something extra that containerd nodes need. @a-hilaly any idea what that might be?

Pulling image "XXXXXX.dkr.ecr.ap-southeast-2.amazonaws.com/mycontainer:1.0.1" Warning Failed 8s (x3 over 47s) kubelet Failed to pull image "XXXXXX.dkr.ecr.ap-southeast-2.amazonaws.com/mycontainer:1.0.1": rpc error: code = NotFound desc = failed to pull and unpack image "XXXXXXX.dkr.ecr.ap-southeast-2.amazonaws.com/mycontainer:1.0.1": failed to copy: httpReadSeeker: failed open: could not fetch content descriptor sha256:d713dedd5b37c3ffea46d23c7933cc173c7755c789eab3bc60ea374cb5af740f (application/vnd.docker.distribution.manifest.v1+json) from remote: not found

whereisaaron avatar Sep 09 '23 14:09 whereisaaron