amazon-eks-ami icon indicating copy to clipboard operation
amazon-eks-ami copied to clipboard

Node not joining cluster: kubelet failure- missing /etc/kubernetes/pki/ca.crt

Open Airwise opened this issue 4 years ago • 2 comments

What happened: Deployed an EKS cluster with 3 worker nodes via cloudformation. 2 nodes joined cluster, 1 did not. Restarting the instance had no effect. Terminating the instance once and letting the ASG spin up a new one had no effect. Terminating the subsequent instance resulted in a new instance that did register with the cluster.

What you expected to happen: Expected all 3 original nodes to join cluster.

How to reproduce it (as minimally and precisely as possible): Unfortunately as described this appears to be sporadic. Our cloud formation template is pretty standard. Excerpt:

  ###
  # This Stack Provisions the EKS Cluster
  ###
  EKSCluster:
    Type: AWS::CloudFormation::Stack
    Properties:
      Parameters:
        MasterStackName: !Ref MasterStackName
      TemplateURL: './EKS-Cluster.yml'
      TimeoutInMinutes: '15'

  ###
  # This Stack Provisions the EKS Cluster
  ###
  EKSNodes:
    Type: AWS::CloudFormation::Stack
    DependsOn: EKSCluster
    Properties:
      Parameters:
        MasterStackName: !Ref MasterStackName
        NodeImageId: !Ref NodeImageId
        NodeInstanceType: !Ref NodeInstanceType
        NodeGroupName: !Ref NodeGroupName
        EKSClusterName: !Sub '${EKSCluster.Outputs.EKSClusterName}'
        KeyName: !Ref KeyName
        SQSInterfaceKMSKeyArn: !Ref SQSInterfaceKMSKeyArn
        NodeAutoScalingGroupMinSize: !Ref NodeAutoScalingGroupMinSize
        NodeAutoScalingGroupMaxSize: !Ref NodeAutoScalingGroupMaxSize
        EKSClusterControlPlaneSecurityGroup: !Sub '${EKSCluster.Outputs.EKSClusterControlPlaneSecurityGroup}'
        R53SubDomain: !Ref R53SubDomain
        R53RootDomain: !Ref R53RootDomain
        VpcId: !Ref VpcId
        EnvType: !Ref EnvType
        UseKeystoreBucket: !Ref UseKeystoreBucket
        KeystoreAccount: !Ref KeystoreAccount
        ClassB: !Ref ClassB
      TemplateURL: './EKS-Nodes.yml'
      TimeoutInMinutes: '15'

Cluster looks like:

  Cluster:
    Type: "AWS::EKS::Cluster"
    Properties:
      Version: "1.13"
      RoleArn: !GetAtt ClusterRole.Arn
      ResourcesVpcConfig:
        SecurityGroupIds:
          - !Ref ClusterControlPlaneSecurityGroup
        SubnetIds:
          - Fn::ImportValue:
              !Sub "${MasterStackName}-SubnetAPrivate"
          - Fn::ImportValue:
              !Sub "${MasterStackName}-SubnetBPrivate"
          - Fn::ImportValue:
              !Sub "${MasterStackName}-SubnetCPrivate"
          - Fn::ImportValue:
              !Sub "${MasterStackName}-SubnetAPublic"
          - Fn::ImportValue:
              !Sub "${MasterStackName}-SubnetBPublic"
          - Fn::ImportValue:
              !Sub "${MasterStackName}-SubnetCPublic"

Node group and launch looks like:

  NodeGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      DesiredCapacity: !If [ IsProdEnv, 5, 3 ]
      LaunchConfigurationName: !Ref NodeLaunchConfig
      MinSize: !Ref NodeAutoScalingGroupMinSize
      MaxSize: !Ref NodeAutoScalingGroupMaxSize
      VPCZoneIdentifier:
        - Fn::ImportValue:
            !Sub "${MasterStackName}-SubnetAPrivate"
        - Fn::ImportValue:
            !Sub "${MasterStackName}-SubnetBPrivate"
        - Fn::ImportValue:
            !Sub "${MasterStackName}-SubnetCPrivate"
      Tags:
      - Key: Name
        Value: !Sub "${EKSClusterName}-${NodeGroupName}-Node"
        PropagateAtLaunch: 'true'
      - Key: !Sub 'kubernetes.io/cluster/${EKSClusterName}'
        Value: 'owned'
        PropagateAtLaunch: 'true'
      - Key: k8s.io/cluster-autoscaler/enabled
        Value: ''
        PropagateAtLaunch: 'true'
      - Key: !Sub 'k8s.io/cluster-autoscaler/${EKSClusterName}'
        Value: ''
        PropagateAtLaunch: 'true'
    UpdatePolicy:
      AutoScalingRollingUpdate:
        MaxBatchSize: 1
        MinInstancesInService: !If [ IsProdEnv, 5, 3 ]
        PauseTime: PT5M

  NodeLaunchConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      AssociatePublicIpAddress: 'false'
      IamInstanceProfile: !Ref NodeInstanceProfile
      ImageId: !Ref NodeImageId
      InstanceType: !Ref NodeInstanceType
      KeyName: !Ref KeyName
      SecurityGroups:
        - !Ref NodeSecurityGroup
      BlockDeviceMappings:
        - DeviceName: /dev/xvda
          Ebs:
            VolumeSize: !Ref NodeVolumeSize
            VolumeType: gp2
            DeleteOnTermination: true
      UserData:
        Fn::Base64:
          !Sub |
            Content-Type: multipart/mixed; boundary="==BOUNDARY=="
            MIME-Version: 1.0
            --==BOUNDARY==
            Content-Type: text/cloud-boothook; charset="us-ascii"

            # Set the proxy hostname and port
            PROXY="proxy.${R53SubDomain}.${R53RootDomain}:3128"

            # Create the docker systemd directory
            mkdir -p /etc/systemd/system/docker.service.d

            # Configure yum to use the proxy
            cat << EOF >> /etc/yum.conf
            proxy=http://$PROXY
            EOF

            # Set the proxy for future processes, and use as an include file
            cat << EOF >> /etc/environment
            http_proxy=http://$PROXY
            https_proxy=http://$PROXY
            HTTP_PROXY=http://$PROXY
            HTTPS_PROXY=http://$PROXY
            no_proxy=10.${ClassB}.0.0/16,localhost,127.0.0.1,169.254.169.254,.internal
            NO_PROXY=10.${ClassB}.0.0/16,localhost,127.0.0.1,169.254.169.254,.internal
            EOF

            # Configure docker with the proxy
            cat << EOF >> /etc/systemd/system/docker.service.d/proxy.conf
            [Service]
            EnvironmentFile=/etc/environment
            EOF


            # Configure the kubelet with the proxy
            cat << EOF >> /etc/systemd/system/kubelet.service.d/proxy.conf
            [Service]
            EnvironmentFile=/etc/environment
            EOF

            --==BOUNDARY==
            Content-Type: text/x-shellscript; charset="us-ascii"

            #!/bin/bash
            set -o xtrace

            # Set the proxy variables before running the bootstrap.sh



             script
            set -a
            source /etc/environment

            /etc/eks/bootstrap.sh ${EKSClusterName} ${BootstrapArguments}
            /opt/aws/bin/cfn-signal --exit-code $? \
                      --stack  ${AWS::StackName} \
                      --resource NodeGroup  \
                      --region ${AWS::Region}

            --==BOUNDARY==--

So far this has happened 1 out of 4 attempts to stand up clusters with more than 1 node.

Anything else we need to know?: kubelet logs show:

kubelet[12504]: F0327 21:30:29.616103   12504 server.go:244] unable to load client CA file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no such file or directory)

May be a false flag, but I also notice that the problem node had a cloud-init process that failed to complete. It was still running stuck in the final module hours after the node came up when I went in to debug.

I've also posted against a similar issue here: https://forums.aws.amazon.com/thread.jspa?messageID=937703&#937703

Environment:

  • AWS Region: us-east1
  • Instance Type(s): t2-medium
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.8
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.13
  • AMI Version: amazon-eks-node-1.13-v20190701 (ami-0f2e8e5663e16b436)
  • Kernel (e.g. uname -a): Linux 4.14.128-112.105.amzn2.x86_64
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-052b172b0a9552df4"
BUILD_TIME="Mon Jul  1 21:38:37 UTC 2019"
BUILD_KERNEL="4.14.123-111.109.amzn2.x86_64"
ARCH="x86_64"

Airwise avatar Mar 28 '20 00:03 Airwise

Faced similar issue.

Can confirm that it fails when cloud-init fails.

journalctl -u cloud-final.service |cat 

# ... a custom bootstrap script successfully finished 
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Ncat: Connection refused.
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Apr 16 03:23:28 cloud-init[4607]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Apr 16 03:23:28 cloud-init[4607]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Apr 16 03:23:28 cloud-init[4607]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]:
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: #############################################################
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: -----BEGIN SSH HOST KEY FINGERPRINTS-----
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: 256 SHA256:dmw4QLBRS...ccLadpNXQs no comment (ECDSA)
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: 256 SHA256:dnt+0Uby3...JUZ844 no comment (ED25519)
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: 2048 SHA256:zhvc...zB16peU no comment (RSA)
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Cloud-init v. 19.3-43.amzn2 finished at Fri, 16 Apr 2021 03:23:28 +0000. Datasource DataSourceEc2.  Up 129.42 seconds
Apr 16 03:23:28 host-name-eks-1111 systemd[1]: cloud-final.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 16 03:23:28 host-name-eks-1111 systemd[1]: Failed to start Execute cloud user/final scripts.
Apr 16 03:23:28 host-name-eks-1111 systemd[1]: Unit cloud-final.service entered failed state.
Apr 16 03:23:28 host-name-eks-1111 systemd[1]: cloud-final.service failed.

AMI based on amazon-eks-node-1.19-v2021032

I also found this issue where people recommend disabling firewall. Which makes me think that it is somehow network-related. Plus I see the Ncat: Connection refused. error in the log above, but all of these are just random guesses.

Update: I guess I found my issue. There was a mistake in /etc/eksctl/kubelet-config.json where one of its parameter was configured in quotes (strring) when kubelet expected int32. How did I get the missing /etc/kubernetes/pki/ca.crt error at first place? It is because I copied the wrong launch command from /etc/systemd/system/kubelet.service when I was trying to debug the issue by starting it manually.

So, after all, it was all my fault.

KIVagant avatar Apr 16 '21 03:04 KIVagant

I'm facing the same problem using terraform to deploy my EKS cluster. Because of that, new nodes can't be registered in the cluster, and I'm currently trying to understand why we are facing this problem just right now.

One thing that I did to just run all the user data script was remove the -e flag in the #/bin/bash, as some people recommend wen facing issues like that in the user data. And right now nodes can enter in the cluster, but I'm pretty sure that this is not the best approach.

orona-yuca avatar Dec 04 '21 13:12 orona-yuca