amazon-eks-ami icon indicating copy to clipboard operation
amazon-eks-ami copied to clipboard

Kubelet dont start on nodes

Open wh1sssss opened this issue 4 years ago • 5 comments

Hello folks,

I need a help with runtime containerd running on worker nodes kubelet version 1.21.

Attention on image, i use from:

https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html

I have created my cluster with the file below:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: eks-xxx
  region: us-east-1
  version: "1.21"
  tags:
    managed: eksctl
    owner: eks-xxx

vpc:
  id: "vpc-xxxx"
  securityGroup: "sg-xxx"
  subnets:
    private:
        us-east-1a:
          id: "subnet-xxxx"
        us-east-1b:
          id: "subnet-xxxx"
        us-east-1c:
          id: "subnet-xxxx"
    public:
        us-east-1a:
          id: "subnet-yyyyy"
        us-east-1b:
          id: "subnet-yyyyy"
        us-east-1c:
          id: "subnet-yyyyy"

nodeGroups:
  - name: gp01
    volumeSize: 50
    volumeType: gp2
    ami: **ami-0193ebf9573ebc9f7**
    overrideBootstrapCommand: |
      #!/bin/bash
      /etc/eks/bootstrap.sh eks-xxx --container-runtime containerd
    privateNetworking: true
    minSize: 1
    maxSize: 2
    instancesDistribution:
        maxPrice: 0.095000
        instanceTypes:
          - "t4g.xlarge"
          - "t3.xlarge"
          - "t2.xlarge"
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotInstancePools: 4
    securityGroups:
        withShared: true
        withLocal: true
    iam:
      attachPolicyARNs:
        - arn:aws:iam::983054731403:policy/eks-xxx-xxx
      withAddonPolicies:
        ALBIngress: true
        AutoScaler: true
        CertManager: true
        CloudWatch: true
        EBS: true
        EFS: true
        ExternalDNS: true
        ImageBuilder: true # ECR
        XRay: true
    ssh:
      publicKeyPath: ./keys/eks-xxx

cloudWatch:
  clusterLogging:
    enableTypes: ["audit", "authenticator"]

Everything went well until the part of creating the node group, it creates the nodes but does not attach to the cluster, when logging into the node I received the following logs from kubelete.

Command to see logs:

sudo cat /var/log/messages

Logs:

Sep  9 20:39:33 ip-10-53-2-218 systemd: Started Session c320 of user root.
Sep  9 20:39:33 ip-10-53-2-218 containerd: time="2021-09-09T20:39:33.807930482Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{},XXX_unrecognized:[],}"
Sep  9 20:39:33 ip-10-53-2-218 containerd: time="2021-09-09T20:39:33.821960041Z" level=info msg="ImageUpdate event &ImageUpdate{Name:sha256:106a8e54d5eb3f70fcd1ed46255bdf232b3f169e89e68e13e4e67b25f59c1315,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"
Sep  9 20:39:33 ip-10-53-2-218 containerd: time="2021-09-09T20:39:33.822587955Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1: resolving      |#033[32m#033[0m--------------------------------------|
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: elapsed: 0.1 s                                                         total:   0.0 B (0.0 B/s)
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1:            resolved       |#033[32m++++++++++++++++++++++++++++++++++++++#033[0m|
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: index-sha256:1cb4ab85a3480446f9243178395e6bee7350f0d71296daeb6a9fdd221e23aea6:    done           |#033[32m++++++++++++++++++++++++++++++++++++++#033[0m|
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: manifest-sha256:234b8785dd78afc0fbb27edad009e7eb253e5685fb7387d4f0145f65c00873ac: done           |#033[32m++++++++++++++++++++++++++++++++++++++#033[0m|
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: layer-sha256:41d8806bd3d23e1ffb7e9825fa56a0c2e851dfeeb405477ab1d6bc3a34bc0da2:    done           |#033[32m++++++++++++++++++++++++++++++++++++++#033[0m|
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: config-sha256:106a8e54d5eb3f70fcd1ed46255bdf232b3f169e89e68e13e4e67b25f59c1315:   done           |#033[32m++++++++++++++++++++++++++++++++++++++#033[0m|
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: elapsed: 0.2 s                                                                    total:   0.0 B (0.0 B/s)
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: unpacking linux/amd64 sha256:1cb4ab85a3480446f9243178395e6bee7350f0d71296daeb6a9fdd221e23aea6...
Sep  9 20:39:33 ip-10-53-2-218 pull-sandbox-image.sh: done
Sep  9 20:39:33 ip-10-53-2-218 systemd: Started pull sandbox image defined in containerd config.toml.
Sep  9 20:39:33 ip-10-53-2-218 systemd: Failed to load environment files: No such file or directory
Sep  9 20:39:33 ip-10-53-2-218 systemd: kubelet.service failed to run 'start-pre' task: No such file or directory
Sep  9 20:39:33 ip-10-53-2-218 systemd: Failed to start Kubernetes Kubelet.
Sep  9 20:39:33 ip-10-53-2-218 systemd: Unit kubelet.service entered failed state.
Sep  9 20:39:33 ip-10-53-2-218 systemd: kubelet.service failed.
Sep  9 20:39:33 ip-10-53-2-218 systemd: Removed slice User Slice of root.
Sep  9 20:39:38 ip-10-53-2-218 systemd: kubelet.service holdoff time over, scheduling restart.
Sep  9 20:39:38 ip-10-53-2-218 systemd: Stopped Kubernetes Kubelet.
Sep  9 20:39:38 ip-10-53-2-218 systemd: Starting pull sandbox image defined in containerd config.toml...
Sep  9 20:39:39 ip-10-53-2-218 systemd: Created slice User Slice of root.

I tried to create the file containerd-config.toml empty for tests but other errors came.

Thanks all.

wh1sssss avatar Sep 09 '21 21:09 wh1sssss

Hi,

Could you check status of sandbox-service, containerd and kubelet? Essentially

  • systemctl status sandbox-image -l
  • systemctl status containerd -l
  • systemctl status kubelet -l
  • Also check if there are any errors in /var/log/cloud-init-output.log

Thanks,

ravisinha0506 avatar Sep 10 '21 17:09 ravisinha0506

Hello @ravisinha0506, thanks for answer!!

systemctl status sandbox-image -l

$ sudo systemctl status sandbox-image -l
● sandbox-image.service - pull sandbox image defined in containerd config.toml
   Loaded: loaded (/etc/systemd/system/sandbox-image.service; enabled; vendor preset: disabled)
   Active: activating (start) since sex 2021-09-10 18:22:01 UTC; 247ms ago
 Main PID: 8511 (bash)
    Tasks: 2
   Memory: 12.0M
   CGroup: /system.slice/sandbox-image.service
           ├─8511 bash /etc/eks/containerd/pull-sandbox-image.sh
           └─8545 /usr/bin/python2 -s /usr/bin/aws ecr get-login-password --region us-east-1

set 10 18:22:01 ip-10-53-2-43.ec2.internal systemd[1]: Starting pull sandbox image defined in containerd config.toml...

systemctl status containerd -l

$ sudo systemctl status containerd -l
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
   Active: active (running) since sex 2021-09-10 18:16:19 UTC; 6min ago
     Docs: https://containerd.io
 Main PID: 3010 (containerd)
    Tasks: 12
   Memory: 70.1M
   CGroup: /system.slice/containerd.service
           └─3010 /usr/bin/containerd

set 10 18:22:27 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:27.202994749Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"
set 10 18:22:33 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:33.262510241Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{},XXX_unrecognized:[],}"
set 10 18:22:33 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:33.273228096Z" level=info msg="ImageUpdate event &ImageUpdate{Name:sha256:106a8e54d5eb3f70fcd1ed46255bdf232b3f169e89e68e13e4e67b25f59c1315,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"
set 10 18:22:33 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:33.273757170Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"
set 10 18:22:39 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:39.888685985Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{},XXX_unrecognized:[],}"
set 10 18:22:39 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:39.895833426Z" level=info msg="ImageUpdate event &ImageUpdate{Name:sha256:106a8e54d5eb3f70fcd1ed46255bdf232b3f169e89e68e13e4e67b25f59c1315,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"
set 10 18:22:39 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:39.896594971Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"
set 10 18:22:47 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:47.481357165Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{},XXX_unrecognized:[],}"
set 10 18:22:47 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:47.486029158Z" level=info msg="ImageUpdate event &ImageUpdate{Name:sha256:106a8e54d5eb3f70fcd1ed46255bdf232b3f169e89e68e13e4e67b25f59c1315,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"
set 10 18:22:47 ip-10-53-2-43.ec2.internal containerd[3010]: time="2021-09-10T18:22:47.487458911Z" level=info msg="ImageUpdate event &ImageUpdate{Name:602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.1-eksbuild.1,Labels:map[string]string{io.cri-containerd.image: managed,},XXX_unrecognized:[],}"

systemctl status kubelet -l

$ sudo systemctl status kubelet -l
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-eksctl.al2.conf, 10-kubelet-args.conf
   Active: activating (auto-restart) (Result: resources) since sex 2021-09-10 18:24:54 UTC; 2s ago
     Docs: https://github.com/kubernetes/kubernetes

set 10 18:24:54 ip-10-53-2-43.ec2.internal systemd[1]: Failed to load environment files: No such file or directory
set 10 18:24:54 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed to run 'start-pre' task: No such file or directory
set 10 18:24:54 ip-10-53-2-43.ec2.internal systemd[1]: Failed to start Kubernetes Kubelet.
set 10 18:24:54 ip-10-53-2-43.ec2.internal systemd[1]: Unit kubelet.service entered failed state.
set 10 18:24:54 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed.

cat /var/log/cloud-init-output.log

$ sudo cat /var/log/cloud-init-output.log
Cloud-init v. 19.3-44.amzn2 running 'init-local' at Fri, 10 Sep 2021 18:16:01 +0000. Up 12.41 seconds.
Cloud-init v. 19.3-44.amzn2 running 'init' at Fri, 10 Sep 2021 18:16:03 +0000. Up 14.18 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: |  eth0  | True |          10.53.2.43         | 255.255.255.0 | global | 0a:33:ea:bb:98:59 |
ci-info: |  eth0  | True | fe80::833:eaff:febb:9859/64 |       .       |  link  | 0a:33:ea:bb:98:59 |
ci-info: |   lo   | True |          127.0.0.1          |   255.0.0.0   |  host  |         .         |
ci-info: |   lo   | True |           ::1/128           |       .       |  host  |         .         |
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: +++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: | Route |   Destination   |  Gateway  |     Genmask     | Interface | Flags |
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: |   0   |     0.0.0.0     | 10.53.2.1 |     0.0.0.0     |    eth0   |   UG  |
ci-info: |   1   |    10.53.2.0    |  0.0.0.0  |  255.255.255.0  |    eth0   |   U   |
ci-info: |   2   | 169.254.169.254 |  0.0.0.0  | 255.255.255.255 |    eth0   |   UH  |
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: |   9   |  fe80::/64  |    ::   |    eth0   |   U   |
ci-info: |   11  |    local    |    ::   |    eth0   |   U   |
ci-info: |   12  |  multicast  |    ::   |    eth0   |   U   |
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 19.3-44.amzn2 running 'modules:config' at Fri, 10 Sep 2021 18:16:05 +0000. Up 16.35 seconds.
Loaded plugins: priorities, update-motd, versionlock
No packages needed for security; 0 packages available
No packages marked for update
Cloud-init v. 19.3-44.amzn2 running 'modules:final' at Fri, 10 Sep 2021 18:16:12 +0000. Up 23.79 seconds.
Created symlink from /etc/systemd/system/multi-user.target.wants/containerd.service to /usr/lib/systemd/system/containerd.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/sandbox-image.service to /etc/systemd/system/sandbox-image.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
Job for kubelet.service failed because a configured resource limit was exceeded. See "systemctl status kubelet.service" and "journalctl -xe" for details.
Exited with error on line 447
Sep 10 18:16:24 cloud-init[2779]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
Sep 10 18:16:24 cloud-init[2779]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Sep 10 18:16:24 cloud-init[2779]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 19.3-44.amzn2 finished at Fri, 10 Sep 2021 18:16:24 +0000. Datasource DataSourceEc2.  Up 35.27 seconds

sudo journalctl -u kubelet.service

$ sudo journalctl -u kubelet.service
-- Logs begin at sex 2021-09-10 18:15:50 UTC, end at sex 2021-09-10 18:35:30 UTC. --
set 10 18:16:24 ip-10-53-2-43.ec2.internal systemd[1]: Failed to load environment files: No such file or directory
set 10 18:16:24 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed to run 'start-pre' task: No such file or directory
set 10 18:16:24 ip-10-53-2-43.ec2.internal systemd[1]: Failed to start Kubernetes Kubelet.
set 10 18:16:24 ip-10-53-2-43.ec2.internal systemd[1]: Unit kubelet.service entered failed state.
set 10 18:16:24 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed.
set 10 18:16:29 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service holdoff time over, scheduling restart.
set 10 18:16:29 ip-10-53-2-43.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
set 10 18:16:30 ip-10-53-2-43.ec2.internal systemd[1]: Failed to load environment files: No such file or directory
set 10 18:16:30 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed to run 'start-pre' task: No such file or directory
set 10 18:16:30 ip-10-53-2-43.ec2.internal systemd[1]: Failed to start Kubernetes Kubelet.
set 10 18:16:30 ip-10-53-2-43.ec2.internal systemd[1]: Unit kubelet.service entered failed state.
set 10 18:16:30 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed.
set 10 18:16:35 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service holdoff time over, scheduling restart.
set 10 18:16:35 ip-10-53-2-43.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
set 10 18:16:36 ip-10-53-2-43.ec2.internal systemd[1]: Failed to load environment files: No such file or directory
set 10 18:16:36 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed to run 'start-pre' task: No such file or directory
set 10 18:16:36 ip-10-53-2-43.ec2.internal systemd[1]: Failed to start Kubernetes Kubelet.
set 10 18:16:36 ip-10-53-2-43.ec2.internal systemd[1]: Unit kubelet.service entered failed state.
set 10 18:16:36 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed.
set 10 18:16:42 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service holdoff time over, scheduling restart.
set 10 18:16:42 ip-10-53-2-43.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
set 10 18:16:43 ip-10-53-2-43.ec2.internal systemd[1]: Failed to load environment files: No such file or directory
set 10 18:16:43 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed to run 'start-pre' task: No such file or directory
set 10 18:16:43 ip-10-53-2-43.ec2.internal systemd[1]: Failed to start Kubernetes Kubelet.
set 10 18:16:43 ip-10-53-2-43.ec2.internal systemd[1]: Unit kubelet.service entered failed state.
set 10 18:16:43 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed.
set 10 18:16:48 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service holdoff time over, scheduling restart.
set 10 18:16:48 ip-10-53-2-43.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
set 10 18:16:49 ip-10-53-2-43.ec2.internal systemd[1]: Failed to load environment files: No such file or directory
set 10 18:16:49 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed to run 'start-pre' task: No such file or directory
set 10 18:16:49 ip-10-53-2-43.ec2.internal systemd[1]: Failed to start Kubernetes Kubelet.
set 10 18:16:49 ip-10-53-2-43.ec2.internal systemd[1]: Unit kubelet.service entered failed state.
set 10 18:16:49 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed.

I want to use x86 image, i asked for help at eksctl issues but follow the steps they said to me, eksctl uses the arm64 image default, if u want to see more information there:

https://github.com/weaveworks/eksctl/issues/4206

Thanks again!!!!!

wh1sssss avatar Sep 10 '21 18:09 wh1sssss

@wh1sssss, Is this still an issue? Looks like kubelet is not coming up due to this:

set 10 18:16:24 ip-10-53-2-43.ec2.internal systemd[1]: Failed to load environment files: No such file or directory
set 10 18:16:24 ip-10-53-2-43.ec2.internal systemd[1]: kubelet.service failed to run 'start-pre' task: No such file or directory

If this is still an active issue, could you share the EKS ami on which this is happening?

ravisinha0506 avatar Nov 23 '21 21:11 ravisinha0506

@ravisinha0506 Hello, still an issue!

I use the image ami-0193ebf9573ebc9f7.

I need to use amd64 ec2 and CRI containerd on EKS 1.21 but i cant do it, now i use amd64 with docker shim.

Thankssss for answer.

wh1sssss avatar Nov 24 '21 14:11 wh1sssss

@wh1sssss you have 3 different instance classes in your node group that might be causing you issues. I'd drop the t2 as it's not a nitro backed instance type and then for testing purposes either use a t3 (x86) or t4g (arm).

stevehipwell avatar Nov 29 '21 16:11 stevehipwell