amazon-eks-ami
amazon-eks-ami copied to clipboard
AWS EKS - remote error: tls: internal error - CSR pending
What happened: We have EKS cluster deployed with managed nodes. When we try to run kubectl logs or kubectl exec it is giving Error from server: error dialing backend: remote error: tls: internal error. In the admin console, it is showing all the Nodes are ready and Workloads are ready. Then I run kubectl get csr and it is showing all requests as Pending. Then I described a CSR it seems like the details are correct. Please refer the below output
Name: csr-zz882
Labels: <none>
Annotations: <none>
CreationTimestamp: Sat, 13 Feb 2021 15:03:31 +0000
Requesting User: system:node:ip-192-168-33-152.ec2.internal
Signer: kubernetes.io/kubelet-serving
Status: Pending
Subject:
Common Name: system:node:ip-192-168-33-152.ec2.internal
Serial Number:
Organization: system:nodes
Subject Alternative Names:
DNS Names: ec2-3-239-231-25.compute-1.amazonaws.com
ip-192-168-33-152.ec2.internal
IP Addresses: 192.168.33.152
3.239.231.25
Events: <none>
Anything else we need to know?:This issue came suddenly. Our guess after scalling
Environment:
- AWS Region: North Virginia
- Instance Type(s): M5.Large
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion):eks.3 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version):1.18 - AMI Version:AL2_x86_64
- Kernel (e.g.
uname -a):Linux ip-192-168-33-152.ec2.internal 4.14.214-160.339.amzn2.x86_64 #1 SMP Sun Jan 10 05:53:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - Release information (run
cat /etc/eks/releaseon a node):
BASE_AMI_ID="ami-002eb42333992c419"
BUILD_TIME="Mon Feb 8 20:17:23 UTC 2021"
BUILD_KERNEL="4.14.214-160.339.amzn2.x86_64"
ARCH="x86_64"
In order to debug this issue, we will need the cluster ARN. I recommend creating a support case with AWS and providing relevant details there.
To anyone with a similar issue, be aware AWS will charge you for support cases, but fail to diagnose or help in any way.
Any update on this? We have experienced this three times now, each time having to delete and recreate the cluster. AWS support couldn't reproduce it on their side, charged us for the support case they never solved, and then asked us to reproduce it for them, giving the following response:
AWS Support:
Also, I've tested it in my cluster by scaling the worker nodes from the eks console but in my case the node was launched successfully.
Therefore, please check once again if you can reproduce this issue, if so please share the steps and the logs/outputs that I've requested in my previous correspondence and I'll investigate this further.
In our case, AWS terminated a node (without notifying or requesting): W0708 15:12:35.439299 1 aws.go:1730] the instance i-04c7a**** is terminated I0708 15:12:35.439314 1 node_lifecycle_controller.go:156] deleting node since it is no longer present in cloud provider: ip-********.eu-west-1.compute.internal
The node that came back up started with TLS issue, brought down parts of our system and now the cluster is again unhealthy.
CSR's from nodes have the following auto-approve config:
# Approve renewal CSRs for the group "system:nodes"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: auto-approve-renewals-for-nodes
subjects:
- kind: Group
name: system:nodes
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
apiGroup: rbac.authorization.k8s.io
but remain in the pending state.
Just ran into this issue (or something remarkably similar) in my EKS cluster.
Is this an issue with the AMI, or is this a problem in the control plane?
Damn EKS for years it is having stupid problems
Check that in aws-auth configmap is no duplicated values in mapRoles and mapUsers, this was my case.
@Dr4il Thanks for your solution. How does it relate to TLS error? Could you provide some more information about how you found it?
Not exactly the same problem (instead of "pending" we got stuck with "approved" but not "issued" status) but maybe it can help someone.
In our case, it was just bad timing. In user-metadata we apply some changes for containerd config and then restart it. Sometimes restart happens just after creating a CSR but before the actual certificate gets issued (a rather small window of about a couple of seconds) and downloaded by kubelet. Seems like restarting containerd also causes kubelet to restart too and create another CSR. For some reason, EKS doesn't issue that second CSR for about 10 minutes (any new CSR for that exact node actually). That causes "tls: internal error" error for some new nodes for about 10 minutes.
Thanks @Dr4il for the idea about aws-auth configmap.
@rtripat to reproduce the issue here is what I did:
- set different
usernamefor the samerolearnintomapRolesof the aws-auth configmap:
...
- "groups":
- "system:bootstrappers"
- "system:nodes"
"rolearn": "INSTANCE_ROLE_ARN"
"username": "system:node:{{EC2PrivateDNSName}}"
- "groups":
- "system:masters"
"rolearn": "INSTANCE_ROLE_ARN"
"username": "test"
...
-
recycle 1 node of the node group associated to
INSTANCE_ROLE_ARN -
deploy a pod to that new node and verify the issue
output of kubectl logs POD_NAME -n NAMESPACE:
Error from server: Get "https://X.X.X.X:10250/containerLogs/...": remote error: tls: internal error
output of kubectl get csr -n NAMESPACE:
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-vgrfh 4m42s kubernetes.io/kubelet-serving test <none> Pending
To workaround the issue I put the same username for the same rolarn into the aws-auth configmap:
...
- "groups":
- "system:bootstrappers"
- "system:nodes"
"rolearn": "INSTANCE_ROLE_ARN"
"username": "system:node:{{EC2PrivateDNSName}}"
- "groups":
- "system:masters"
"rolearn": "INSTANCE_ROLE_ARN"
"username": "system:node:{{EC2PrivateDNSName}}"
...
Currently I 'm also facing issues
I had the same issue. In my case the reason was the custom AMI I switched to. It probably did not support v1.22
In our case that was a clear difference in "hostname type" setting for the subnet, nodes are created in. With the same cluster, with all same configs, when subnet, nodes were created in, has the setting set to "resource name", and nodes get names like "i-0977c7690f78d6d5f.eu-central-1.compute.internal" they were not able to join the cluster properly with that error. With the setting changed to "IP name", so nodes start getting names like "ip-10-1-35-198.eu-central-1.compute.internal" they worked just fine.
You can even see here in the output of kubectl get csr traces of both those name types attempting to join the cluster, when second ones succeeded, and first ones don't:
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-kl722 16m kubernetes.io/kubelet-serving system:node:i-0977c7690f78d6d5f.eu-central-1.compute.internal <none> Approved
csr-mbbvp 50s kubernetes.io/kubelet-serving system:node:ip-10-1-35-198.eu-central-1.compute.internal <none> Approved,Issued
csr-r2n4z 16m kubernetes.io/kubelet-serving system:node:i-0f3e6bf012164f037.eu-central-1.compute.internal <none> Approved
csr-tr9b4 57s kubernetes.io/kubelet-serving system:node:ip-10-1-37-252.eu-central-1.compute.internal <none> Approved,Issued
It looks like there is some pattern matching with certificate DNS names in AWS managed master nodes for issuing the node certificates, and when the name doesn't hit it - it just fails.
In my case, when
kubectl get csr
I got all of them Pending and i manually approved one of them like this:
kubectl certificate approve csr-kkz2t
Then, this certificate csr-kkz2t became Approved,Issued, and kubectl logs and exec started working.
kubectl get csrIn our case that was a clear difference in "hostname type" setting for the subnet, nodes are created in. With the same cluster, with all same configs, when subnet, nodes were created in, has the setting set to "resource name", and nodes get names like "i-0977c7690f78d6d5f.eu-central-1.compute.internal" they were not able to join the cluster properly with that error. With the setting changed to "IP name", so nodes start getting names like "ip-10-1-35-198.eu-central-1.compute.internal" they worked just fine.
You can even see here in the output of
kubectl get csrtraces of both those name types attempting to join the cluster, when second ones succeeded, and first ones don't:NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-kl722 16m kubernetes.io/kubelet-serving system:node:i-0977c7690f78d6d5f.eu-central-1.compute.internal <none> Approved csr-mbbvp 50s kubernetes.io/kubelet-serving system:node:ip-10-1-35-198.eu-central-1.compute.internal <none> Approved,Issued csr-r2n4z 16m kubernetes.io/kubelet-serving system:node:i-0f3e6bf012164f037.eu-central-1.compute.internal <none> Approved csr-tr9b4 57s kubernetes.io/kubelet-serving system:node:ip-10-1-37-252.eu-central-1.compute.internal <none> Approved,IssuedIt looks like there is some pattern matching with certificate DNS names in AWS managed master nodes for issuing the node certificates, and when the name doesn't hit it - it just fails.
This was our case. Was pulling my hair out trying to figure out what suddenly went wrong with our EKS module (we didn't parameterize the DNS name's type in our network module) and it turned out someone updated the DNS name's type to resource name. Changing it back to IP name fixed everything.
The v1.22.9-eks-810597c has same issue, if the Hostname type vaule is Resource name of subnet.
I got the same issue after updating from eks 1.24 to 1.25. I fixed it by approve CSR manually. Can it be approved automatically?
My issue was resolved after updating AMI of EKS cluster nodegroup.
Use the following documentation to get correct AMI for nodegroup: https://docs.aws.amazon.com/eks/latest/userguide/eks-linux-ami-versions.html
This also might be useful for some people: https://docs.aws.amazon.com/eks/latest/userguide/cert-signing.html
We faced this issue as well.
It's related to the lack of IAM permissions for either cluster or node_group(s). Also make sure that you have correct KMS key ARN specified in your cluster IAM policy(if applicable).
In my case, the issue was with the aws-auth config map. In the aws-auth configmap, my nodes entry was as follows, which is incorrect.
- groups:
- system:masters
rolearn: arn:aws:iam::ACCOUNT_ID:role/IAM_ROLE
username: XXX
It should be like
- groups:
- system:bootstrappers
- system:nodes
rolearn: arn:aws:iam::ACCOUNT_ID:role/IAM_ROLE
username: system:node:{{EC2PrivateDNSName}}
If you are encountering this issue, there may be some problems with node RBAC in Kubernetes. It's worth checking.