amazon-eks-ami
amazon-eks-ami copied to clipboard
Node not joining cluster: kubelet failure- missing /etc/kubernetes/pki/ca.crt
What happened: Deployed an EKS cluster with 3 worker nodes via cloudformation. 2 nodes joined cluster, 1 did not. Restarting the instance had no effect. Terminating the instance once and letting the ASG spin up a new one had no effect. Terminating the subsequent instance resulted in a new instance that did register with the cluster.
What you expected to happen: Expected all 3 original nodes to join cluster.
How to reproduce it (as minimally and precisely as possible): Unfortunately as described this appears to be sporadic. Our cloud formation template is pretty standard. Excerpt:
###
# This Stack Provisions the EKS Cluster
###
EKSCluster:
Type: AWS::CloudFormation::Stack
Properties:
Parameters:
MasterStackName: !Ref MasterStackName
TemplateURL: './EKS-Cluster.yml'
TimeoutInMinutes: '15'
###
# This Stack Provisions the EKS Cluster
###
EKSNodes:
Type: AWS::CloudFormation::Stack
DependsOn: EKSCluster
Properties:
Parameters:
MasterStackName: !Ref MasterStackName
NodeImageId: !Ref NodeImageId
NodeInstanceType: !Ref NodeInstanceType
NodeGroupName: !Ref NodeGroupName
EKSClusterName: !Sub '${EKSCluster.Outputs.EKSClusterName}'
KeyName: !Ref KeyName
SQSInterfaceKMSKeyArn: !Ref SQSInterfaceKMSKeyArn
NodeAutoScalingGroupMinSize: !Ref NodeAutoScalingGroupMinSize
NodeAutoScalingGroupMaxSize: !Ref NodeAutoScalingGroupMaxSize
EKSClusterControlPlaneSecurityGroup: !Sub '${EKSCluster.Outputs.EKSClusterControlPlaneSecurityGroup}'
R53SubDomain: !Ref R53SubDomain
R53RootDomain: !Ref R53RootDomain
VpcId: !Ref VpcId
EnvType: !Ref EnvType
UseKeystoreBucket: !Ref UseKeystoreBucket
KeystoreAccount: !Ref KeystoreAccount
ClassB: !Ref ClassB
TemplateURL: './EKS-Nodes.yml'
TimeoutInMinutes: '15'
Cluster looks like:
Cluster:
Type: "AWS::EKS::Cluster"
Properties:
Version: "1.13"
RoleArn: !GetAtt ClusterRole.Arn
ResourcesVpcConfig:
SecurityGroupIds:
- !Ref ClusterControlPlaneSecurityGroup
SubnetIds:
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetAPrivate"
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetBPrivate"
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetCPrivate"
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetAPublic"
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetBPublic"
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetCPublic"
Node group and launch looks like:
NodeGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
DesiredCapacity: !If [ IsProdEnv, 5, 3 ]
LaunchConfigurationName: !Ref NodeLaunchConfig
MinSize: !Ref NodeAutoScalingGroupMinSize
MaxSize: !Ref NodeAutoScalingGroupMaxSize
VPCZoneIdentifier:
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetAPrivate"
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetBPrivate"
- Fn::ImportValue:
!Sub "${MasterStackName}-SubnetCPrivate"
Tags:
- Key: Name
Value: !Sub "${EKSClusterName}-${NodeGroupName}-Node"
PropagateAtLaunch: 'true'
- Key: !Sub 'kubernetes.io/cluster/${EKSClusterName}'
Value: 'owned'
PropagateAtLaunch: 'true'
- Key: k8s.io/cluster-autoscaler/enabled
Value: ''
PropagateAtLaunch: 'true'
- Key: !Sub 'k8s.io/cluster-autoscaler/${EKSClusterName}'
Value: ''
PropagateAtLaunch: 'true'
UpdatePolicy:
AutoScalingRollingUpdate:
MaxBatchSize: 1
MinInstancesInService: !If [ IsProdEnv, 5, 3 ]
PauseTime: PT5M
NodeLaunchConfig:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
AssociatePublicIpAddress: 'false'
IamInstanceProfile: !Ref NodeInstanceProfile
ImageId: !Ref NodeImageId
InstanceType: !Ref NodeInstanceType
KeyName: !Ref KeyName
SecurityGroups:
- !Ref NodeSecurityGroup
BlockDeviceMappings:
- DeviceName: /dev/xvda
Ebs:
VolumeSize: !Ref NodeVolumeSize
VolumeType: gp2
DeleteOnTermination: true
UserData:
Fn::Base64:
!Sub |
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0
--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"
# Set the proxy hostname and port
PROXY="proxy.${R53SubDomain}.${R53RootDomain}:3128"
# Create the docker systemd directory
mkdir -p /etc/systemd/system/docker.service.d
# Configure yum to use the proxy
cat << EOF >> /etc/yum.conf
proxy=http://$PROXY
EOF
# Set the proxy for future processes, and use as an include file
cat << EOF >> /etc/environment
http_proxy=http://$PROXY
https_proxy=http://$PROXY
HTTP_PROXY=http://$PROXY
HTTPS_PROXY=http://$PROXY
no_proxy=10.${ClassB}.0.0/16,localhost,127.0.0.1,169.254.169.254,.internal
NO_PROXY=10.${ClassB}.0.0/16,localhost,127.0.0.1,169.254.169.254,.internal
EOF
# Configure docker with the proxy
cat << EOF >> /etc/systemd/system/docker.service.d/proxy.conf
[Service]
EnvironmentFile=/etc/environment
EOF
# Configure the kubelet with the proxy
cat << EOF >> /etc/systemd/system/kubelet.service.d/proxy.conf
[Service]
EnvironmentFile=/etc/environment
EOF
--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -o xtrace
# Set the proxy variables before running the bootstrap.sh
script
set -a
source /etc/environment
/etc/eks/bootstrap.sh ${EKSClusterName} ${BootstrapArguments}
/opt/aws/bin/cfn-signal --exit-code $? \
--stack ${AWS::StackName} \
--resource NodeGroup \
--region ${AWS::Region}
--==BOUNDARY==--
So far this has happened 1 out of 4 attempts to stand up clusters with more than 1 node.
Anything else we need to know?: kubelet logs show:
kubelet[12504]: F0327 21:30:29.616103 12504 server.go:244] unable to load client CA file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no such file or directory)
May be a false flag, but I also notice that the problem node had a cloud-init process that failed to complete. It was still running stuck in the final module hours after the node came up when I went in to debug.
I've also posted against a similar issue here: https://forums.aws.amazon.com/thread.jspa?messageID=937703󤻧
Environment:
- AWS Region: us-east1
- Instance Type(s): t2-medium
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.8 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): 1.13 - AMI Version: amazon-eks-node-1.13-v20190701 (ami-0f2e8e5663e16b436)
- Kernel (e.g.
uname -a
): Linux 4.14.128-112.105.amzn2.x86_64 - Release information (run
cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-052b172b0a9552df4"
BUILD_TIME="Mon Jul 1 21:38:37 UTC 2019"
BUILD_KERNEL="4.14.123-111.109.amzn2.x86_64"
ARCH="x86_64"
Faced similar issue.
Can confirm that it fails when cloud-init
fails.
journalctl -u cloud-final.service |cat
# ... a custom bootstrap script successfully finished
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Ncat: Connection refused.
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Apr 16 03:23:28 cloud-init[4607]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Apr 16 03:23:28 cloud-init[4607]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Apr 16 03:23:28 cloud-init[4607]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]:
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: #############################################################
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: -----BEGIN SSH HOST KEY FINGERPRINTS-----
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: 256 SHA256:dmw4QLBRS...ccLadpNXQs no comment (ECDSA)
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: 256 SHA256:dnt+0Uby3...JUZ844 no comment (ED25519)
Apr 16 03:23:28 host-name-eks-1111 ec2[8894]: 2048 SHA256:zhvc...zB16peU no comment (RSA)
Apr 16 03:23:28 host-name-eks-1111 cloud-init[4607]: Cloud-init v. 19.3-43.amzn2 finished at Fri, 16 Apr 2021 03:23:28 +0000. Datasource DataSourceEc2. Up 129.42 seconds
Apr 16 03:23:28 host-name-eks-1111 systemd[1]: cloud-final.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 16 03:23:28 host-name-eks-1111 systemd[1]: Failed to start Execute cloud user/final scripts.
Apr 16 03:23:28 host-name-eks-1111 systemd[1]: Unit cloud-final.service entered failed state.
Apr 16 03:23:28 host-name-eks-1111 systemd[1]: cloud-final.service failed.
AMI based on amazon-eks-node-1.19-v2021032
I also found this issue where people recommend disabling firewall. Which makes me think that it is somehow network-related. Plus I see the Ncat: Connection refused.
error in the log above, but all of these are just random guesses.
Update:
I guess I found my issue. There was a mistake in /etc/eksctl/kubelet-config.json
where one of its parameter was configured in quotes (strring
) when kubelet
expected int32
. How did I get the missing /etc/kubernetes/pki/ca.crt
error at first place? It is because I copied the wrong launch command from /etc/systemd/system/kubelet.service
when I was trying to debug the issue by starting it manually.
So, after all, it was all my fault.
I'm facing the same problem using terraform to deploy my EKS cluster. Because of that, new nodes can't be registered in the cluster, and I'm currently trying to understand why we are facing this problem just right now.
One thing that I did to just run all the user data script was remove the -e
flag in the #/bin/bash
, as some people recommend wen facing issues like that in the user data. And right now nodes can enter in the cluster, but I'm pretty sure that this is not the best approach.