amazon-eks-ami icon indicating copy to clipboard operation
amazon-eks-ami copied to clipboard

AWS EKS - remote error: tls: internal error - CSR pending

Open ebeyonds opened this issue 4 years ago • 14 comments

What happened: We have EKS cluster deployed with managed nodes. When we try to run kubectl logs or kubectl exec it is giving Error from server: error dialing backend: remote error: tls: internal error. In the admin console, it is showing all the Nodes are ready and Workloads are ready. Then I run kubectl get csr and it is showing all requests as Pending. Then I described a CSR it seems like the details are correct. Please refer the below output

Name:               csr-zz882
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Sat, 13 Feb 2021 15:03:31 +0000
Requesting User:    system:node:ip-192-168-33-152.ec2.internal
Signer:             kubernetes.io/kubelet-serving
Status:             Pending
Subject:
  Common Name:    system:node:ip-192-168-33-152.ec2.internal
  Serial Number:  
  Organization:   system:nodes
Subject Alternative Names:
         DNS Names:     ec2-3-239-231-25.compute-1.amazonaws.com
                        ip-192-168-33-152.ec2.internal
         IP Addresses:  192.168.33.152
                        3.239.231.25
Events:  <none>

Anything else we need to know?:This issue came suddenly. Our guess after scalling

Environment:

  • AWS Region: North Virginia
  • Instance Type(s): M5.Large
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion):eks.3
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version):1.18
  • AMI Version:AL2_x86_64
  • Kernel (e.g. uname -a):Linux ip-192-168-33-152.ec2.internal 4.14.214-160.339.amzn2.x86_64 #1 SMP Sun Jan 10 05:53:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-002eb42333992c419"
BUILD_TIME="Mon Feb  8 20:17:23 UTC 2021"
BUILD_KERNEL="4.14.214-160.339.amzn2.x86_64"
ARCH="x86_64"

ebeyonds avatar Feb 13 '21 17:02 ebeyonds

In order to debug this issue, we will need the cluster ARN. I recommend creating a support case with AWS and providing relevant details there.

rtripat avatar Feb 15 '21 18:02 rtripat

To anyone with a similar issue, be aware AWS will charge you for support cases, but fail to diagnose or help in any way.

Any update on this? We have experienced this three times now, each time having to delete and recreate the cluster. AWS support couldn't reproduce it on their side, charged us for the support case they never solved, and then asked us to reproduce it for them, giving the following response:

AWS Support:

Also, I've tested it in my cluster by scaling the worker nodes from the eks console but in my case the node was launched successfully.

Therefore, please check once again if you can reproduce this issue, if so please share the steps and the logs/outputs that I've requested in my previous correspondence and I'll investigate this further.

In our case, AWS terminated a node (without notifying or requesting): W0708 15:12:35.439299 1 aws.go:1730] the instance i-04c7a**** is terminated I0708 15:12:35.439314 1 node_lifecycle_controller.go:156] deleting node since it is no longer present in cloud provider: ip-********.eu-west-1.compute.internal

The node that came back up started with TLS issue, brought down parts of our system and now the cluster is again unhealthy.

CSR's from nodes have the following auto-approve config:

# Approve renewal CSRs for the group "system:nodes"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: auto-approve-renewals-for-nodes
subjects:
- kind: Group
  name: system:nodes
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
  apiGroup: rbac.authorization.k8s.io

but remain in the pending state.

DiarmuidKelly avatar Jul 09 '21 10:07 DiarmuidKelly

Just ran into this issue (or something remarkably similar) in my EKS cluster.

Is this an issue with the AMI, or is this a problem in the control plane?

gabegorelick avatar Aug 16 '21 18:08 gabegorelick

Damn EKS for years it is having stupid problems

arash-bizcover avatar Nov 01 '21 01:11 arash-bizcover

Check that in aws-auth configmap is no duplicated values in mapRoles and mapUsers, this was my case.

Dr4il avatar Dec 07 '21 13:12 Dr4il

@Dr4il Thanks for your solution. How does it relate to TLS error? Could you provide some more information about how you found it?

kishor-u avatar Feb 23 '22 10:02 kishor-u

Not exactly the same problem (instead of "pending" we got stuck with "approved" but not "issued" status) but maybe it can help someone.

In our case, it was just bad timing. In user-metadata we apply some changes for containerd config and then restart it. Sometimes restart happens just after creating a CSR but before the actual certificate gets issued (a rather small window of about a couple of seconds) and downloaded by kubelet. Seems like restarting containerd also causes kubelet to restart too and create another CSR. For some reason, EKS doesn't issue that second CSR for about 10 minutes (any new CSR for that exact node actually). That causes "tls: internal error" error for some new nodes for about 10 minutes.

stszap avatar Feb 24 '22 10:02 stszap

Thanks @Dr4il for the idea about aws-auth configmap.

@rtripat to reproduce the issue here is what I did:

  • set different username for the same rolearn into mapRoles of the aws-auth configmap:
...
- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "INSTANCE_ROLE_ARN"
  "username": "system:node:{{EC2PrivateDNSName}}"
- "groups":
  - "system:masters"
  "rolearn": "INSTANCE_ROLE_ARN"
  "username": "test"
...
  • recycle 1 node of the node group associated to INSTANCE_ROLE_ARN

  • deploy a pod to that new node and verify the issue

output of kubectl logs POD_NAME -n NAMESPACE:

Error from server: Get "https://X.X.X.X:10250/containerLogs/...": remote error: tls: internal error

output of kubectl get csr -n NAMESPACE:

NAME        AGE     SIGNERNAME                      REQUESTOR                                                  REQUESTEDDURATION   CONDITION
csr-vgrfh   4m42s   kubernetes.io/kubelet-serving   test                                                       <none>              Pending

To workaround the issue I put the same username for the same rolarn into the aws-auth configmap:

...
- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "INSTANCE_ROLE_ARN"
  "username": "system:node:{{EC2PrivateDNSName}}"
- "groups":
  - "system:masters"
  "rolearn": "INSTANCE_ROLE_ARN"
  "username": "system:node:{{EC2PrivateDNSName}}"
...

ghost avatar Apr 21 '22 09:04 ghost

Currently I 'm also facing issues

kiruyoyo avatar Apr 21 '22 16:04 kiruyoyo

I had the same issue. In my case the reason was the custom AMI I switched to. It probably did not support v1.22

trallnag avatar May 03 '22 07:05 trallnag

In our case that was a clear difference in "hostname type" setting for the subnet, nodes are created in. With the same cluster, with all same configs, when subnet, nodes were created in, has the setting set to "resource name", and nodes get names like "i-0977c7690f78d6d5f.eu-central-1.compute.internal" they were not able to join the cluster properly with that error. With the setting changed to "IP name", so nodes start getting names like "ip-10-1-35-198.eu-central-1.compute.internal" they worked just fine.

You can even see here in the output of kubectl get csr traces of both those name types attempting to join the cluster, when second ones succeeded, and first ones don't:

NAME        AGE   SIGNERNAME                      REQUESTOR                                                       REQUESTEDDURATION   CONDITION
csr-kl722   16m   kubernetes.io/kubelet-serving   system:node:i-0977c7690f78d6d5f.eu-central-1.compute.internal   <none>              Approved
csr-mbbvp   50s   kubernetes.io/kubelet-serving   system:node:ip-10-1-35-198.eu-central-1.compute.internal        <none>              Approved,Issued
csr-r2n4z   16m   kubernetes.io/kubelet-serving   system:node:i-0f3e6bf012164f037.eu-central-1.compute.internal   <none>              Approved
csr-tr9b4   57s   kubernetes.io/kubelet-serving   system:node:ip-10-1-37-252.eu-central-1.compute.internal        <none>              Approved,Issued

It looks like there is some pattern matching with certificate DNS names in AWS managed master nodes for issuing the node certificates, and when the name doesn't hit it - it just fails.

Owersun avatar May 11 '22 13:05 Owersun

In my case, when

kubectl get csr

I got all of them Pending and i manually approved one of them like this:

kubectl certificate approve csr-kkz2t

Then, this certificate csr-kkz2t became Approved,Issued, and kubectl logs and exec started working.

deenMuhammad avatar May 18 '22 16:05 deenMuhammad

kubectl get csr

In our case that was a clear difference in "hostname type" setting for the subnet, nodes are created in. With the same cluster, with all same configs, when subnet, nodes were created in, has the setting set to "resource name", and nodes get names like "i-0977c7690f78d6d5f.eu-central-1.compute.internal" they were not able to join the cluster properly with that error. With the setting changed to "IP name", so nodes start getting names like "ip-10-1-35-198.eu-central-1.compute.internal" they worked just fine.

You can even see here in the output of kubectl get csr traces of both those name types attempting to join the cluster, when second ones succeeded, and first ones don't:

NAME        AGE   SIGNERNAME                      REQUESTOR                                                       REQUESTEDDURATION   CONDITION
csr-kl722   16m   kubernetes.io/kubelet-serving   system:node:i-0977c7690f78d6d5f.eu-central-1.compute.internal   <none>              Approved
csr-mbbvp   50s   kubernetes.io/kubelet-serving   system:node:ip-10-1-35-198.eu-central-1.compute.internal        <none>              Approved,Issued
csr-r2n4z   16m   kubernetes.io/kubelet-serving   system:node:i-0f3e6bf012164f037.eu-central-1.compute.internal   <none>              Approved
csr-tr9b4   57s   kubernetes.io/kubelet-serving   system:node:ip-10-1-37-252.eu-central-1.compute.internal        <none>              Approved,Issued

It looks like there is some pattern matching with certificate DNS names in AWS managed master nodes for issuing the node certificates, and when the name doesn't hit it - it just fails.

This was our case. Was pulling my hair out trying to figure out what suddenly went wrong with our EKS module (we didn't parameterize the DNS name's type in our network module) and it turned out someone updated the DNS name's type to resource name. Changing it back to IP name fixed everything.

ntpbnh15 avatar May 26 '22 05:05 ntpbnh15

The v1.22.9-eks-810597c has same issue, if the Hostname type vaule is Resource name of subnet.

jumping avatar Jun 28 '22 09:06 jumping

I got the same issue after updating from eks 1.24 to 1.25. I fixed it by approve CSR manually. Can it be approved automatically?

myvodx avatar Mar 03 '23 03:03 myvodx

My issue was resolved after updating AMI of EKS cluster nodegroup.

Use the following documentation to get correct AMI for nodegroup: https://docs.aws.amazon.com/eks/latest/userguide/eks-linux-ami-versions.html

This also might be useful for some people: https://docs.aws.amazon.com/eks/latest/userguide/cert-signing.html

Shanawar99 avatar Mar 10 '23 22:03 Shanawar99

We faced this issue as well.

It's related to the lack of IAM permissions for either cluster or node_group(s). Also make sure that you have correct KMS key ARN specified in your cluster IAM policy(if applicable).

ikarlashov avatar Mar 15 '23 15:03 ikarlashov

In my case, the issue was with the aws-auth config map. In the aws-auth configmap, my nodes entry was as follows, which is incorrect.

    - groups:
      - system:masters
      rolearn: arn:aws:iam::ACCOUNT_ID:role/IAM_ROLE
      username: XXX

It should be like

    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::ACCOUNT_ID:role/IAM_ROLE
      username: system:node:{{EC2PrivateDNSName}}

If you are encountering this issue, there may be some problems with node RBAC in Kubernetes. It's worth checking.

bhargav2427 avatar Nov 09 '23 04:11 bhargav2427