AL2023 - PrivateDNSName regression
What happened:
With the new AL2023 NodeConfig system, it seems like private DNS names cause problems for new nodes (previously reported in #1263 and fixed in #1264). Our VPC uses a DHCP options set to specify a custom hostname, this prevents nodes from joining the cluster.
What you expected to happen:
Nodes can join the cluster successfully after launch
How to reproduce it (as minimally and precisely as possible):
#1263 gives great reproduction steps
For me, just launching new AL2023 nodes in a VPC with DHCP that sets a domain name causes these logs from kubelet:
"Attempting to register node" node="ip-172-16-0-100.domain.com"
"Unable to register node with API server" err="nodes \"ip-172-16-0-100.domain.com\" is forbidden: node \"ip-172-16-0-100.ec2.internal\" is not allowed to modify node \"ip-172-16-0-100.domain.com\"" node="ip-172-16-0-100.domain.com"
"Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-172-16-0-100.domain.com\" not found"
Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "ip-172-16-0-100.domain.com" is forbidden: User "system:node:ip-172-16-0-100.ec2.internal" cannot getresource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Anything else we need to know?:
Erroneously reported this here: https://github.com/aws/karpenter-provider-aws/issues/5793
Other similar issues: https://github.com/awslabs/amazon-eks-ami/issues/1376 https://github.com/awslabs/amazon-eks-ami/issues/1457
Environment:
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version): 1.29 - AMI Version: ami-0552b3e5085247f36 amazon-eks-node-al2023-x86_64-standard-1.29-v20240227
- Kernel (e.g.
uname -a): 6.1.77-99.164.amzn2023.x86_64 - Release information (run
cat /etc/eks/releaseon a node):
BASE_AMI_ID="ami-0a56ce835d6f72c8e"
BUILD_TIME="Tue Feb 27 23:51:40 UTC 2024"
BUILD_KERNEL="6.1.77-99.164.amzn2023.x86_64"
ARCH="x86_64"
I use Karpenter to launch nodes.... Is there a way to patch the userdata with a blend of bash and the new nodeconfig?
# doesn't actually work
userData: |
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
kubelet:
flags:
- --hostname-override=$(aws ec2 describe-instances --instance-ids $(imds /latest/meta-data/instance-id) --query 'Reservations[].Instances[].PrivateDnsName' --output text)
Sorry about this. We were intending to change the naming convention for nodes in AL2023 from the beginning, to use instance ID's instead of the PrivateDnsName. This had some downstream effects and didn't ultimately make the cut (though we intend to make it opt-in soon). I'll get a PR up to address this.
Now that the fix for this issue has been merged, how long before we can expect to see it released? We're itching to get AL2023 nodes running our EKS cluster.
btw, I ran into this same issue and I found if you set the hostname to be:
TOKEN=$(curl --request PUT "http://169.254.169.254/latest/api/token" --header "X-aws-ec2-metadata-token-ttl-seconds: 10")
ZONE=$(curl http://169.254.169.254/latest/meta-data/placement/region --header "X-aws-ec2-metadata-token: $TOKEN")
IP_BASED_NAME=$(curl http://169.254.169.254/latest/meta-data/hostname --header "X-aws-ec2-metadata-token: $TOKEN" | cut -f1 -d".")
hostnamectl set-hostname --static $IP_BASED_NAME.$ZONE.compute.internal
in your user data you should be able to get your instance working in the meanwhile so you can test before the patch is out.
The fix will be release in next AMI: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240315
Confirmed this fix is working with the following image:
ami_id = ami-07acdbd513e154aa8
image_name = amazon-eks-node-al2023-x86_64-standard-1.27-v20240315