aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
TLS handshake error
Describe the bug I get the following errors in aws-load-balancer-controller pods:
http: TLS handshake error from 172.16.128.63:37094: EOF
http: TLS handshake error from 172.16.116.101:58040: EOF
This happens even if there are no ALB Ingresses (nor NLB Services) handled by the controller. This happens on an AWS EKS cluster that is pretty big (~100 nodes, 13k pods). I created two clusters with the same config but with no load deployed there I don't have such errors there.
Steps to reproduce Install AWS Load Balancer controller via Helm chart ver. 1.4.6 with the following values:
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 200m
memory: 1Gi
nodeSelector:
type: privileged
tolerations:
- key: privileged
operator: Exists
effect: NoExecute
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
replicaCount: 2
podDisruptionBudget:
minAvailable: 1
region: us-east-1
vpcId: <REDACTED>
image:
repository: 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-load-balancer-controller
clusterName: <REDACTED>
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: <REDACTED>
name: aws-load-balancer-controller
Expected outcome No TLS handshake errors.
Environment
- AWS Load Balancer controller version - 2.4.5 (Helm chart version 1.4.6)
- Kubernetes version - EKS 1.23
@pmichna, are these IPs 172.16.128.63 and 172.16.116.101 are assigned to your api servers? The webhook ports require TLS 1.3 and are for access only by the api servers.
@kishorj These IPs are used by ENIs that are not provisioned by us. We suspect they are provisioned by EKS. We can see them in AWS Console, their status is "in-use". However, none of the pods and none of the nodes have these IPs. I checked that with:
kubectl get pods --all-namespaces -o wide | grep "172\.16\.128\.63"
kubectl get nodes --all-namespaces -o wide | grep "172\.16\.128\.63"
When searching for these IPs in EKS logs we can see this:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| @timestamp | @message |
|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 2022-12-21 13:37:37.000 | E1221 13:37:37.291530 11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.52.193:33774: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc2e47ee990), encoder:(*versioning.codec)(0xc2554b4280), buf:(*bytes.Buffer)(0xc02372a240)}) |
| 2022-12-21 13:35:34.000 | E1221 13:35:34.407518 11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.38.52:42914: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc300ab6f18), encoder:(*versioning.codec)(0xc192e8ea00), buf:(*bytes.Buffer)(0xc19c9be300)}) |
| 2022-12-21 12:47:32.000 | E1221 12:47:32.871496 11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.20.145:53350: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc3434812f0), encoder:(*versioning.codec)(0xc1f6dd2820), buf:(*bytes.Buffer)(0xc2c74a06c0)}) |
| 2022-12-21 12:16:04.000 | E1221 12:16:04.619496 11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.33.240:40910: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc340ce3068), encoder:(*versioning.codec)(0xc226284460), buf:(*bytes.Buffer)(0xc01d39e1b0)}) |
| 2022-12-21 12:05:35.000 | E1221 12:05:35.879502 11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.65.188:54982: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc2be9850e0), encoder:(*versioning.codec)(0xc0890d8b40), buf:(*bytes.Buffer)(0xc190f4b680)}) |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I have 7 ingresses (4 pointing to an external ALB and 3 to the internal ALB) set up and I'm seeing similar behavior. I've gotten thousands of these errors over the last 15 days (the longest I can check back in my logs).
Besides the EOF error there's a huge amount of "read tpc connection reset by peer" messages mixed in around the same time.
I'm also using Helm version 1.4.6. Pretty sure this has been happening before that though because I updated from 1.4.5 within 15 days and the error messages happened then too.
we are on v1.4.8 and started seeing this recently too. multiple clusters affected. running two replicas of LBC. not much load or number of pods in these cluster. one internet-facing ALB and one internal ALB.
We are running EKS v1.25 w/ aws-load-balancer-controller v1.5.1 (deployed by Helm) and the errors are still there, however they are written as info level, not error.
For the most part, they don't seem to affect the controller's ability to register targets, but it still looks "scary" to see TLS handshake errors in the logs:
{"level":"info","ts":"2023-04-25T10:34:09Z","msg":"registered targets","arn":"arn:aws:elasticloadbalancing:eu-west-1:$AWS_ACCOUNT_ID:targetgroup/$TG_NAME/$TG_ID"}
2023/04/25 10:34:00 http: TLS handshake error from $KUBE_API_SERVER_IP:49208: EOF
{"level":"info","ts":"2023-04-25T10:34:32Z","msg":"registered targets","arn":"arn:aws:elasticloadbalancing:eu-west-1:$AWS_ACCOUNT_ID:targetgroup/$TG_NAME/$TG_ID"}
2023/04/25 10:35:00 http: TLS handshake error from $KUBE_API_SERVER_IP:42670: EOF
What's strangest, on the Kube API side of things, logs do not seem to show or indicate any communication issue.
I'd expect either the Kube API or kube-controller-manager logs to produce error messages when trying to communicate with the AWS load-balancer-controller's webhook service.
Same here. EKS 1.24, AWS LBC 2.5.3
2023/06/27 07:54:00 http: TLS handshake error from 10.112.7.197:49090: EOF
10.112.7.197 - is the IP of the EKS API Server.
Also seeing the same intermittently
Although aws-load-balancer-controller pod was deployed on Private Subnet with NAT Gateway, Security Groups all ports open from worker node to controlplane, NACL's no restrictions.
Restarting the load balancer controller pod addressed the issue. previous pod was 23 days old.
Hope this helps someone.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Also seing this on two distinct EKS 1.27 clusters, both with many ingresses. The IP addresses belong to the AWS EKS managed control plane.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
same issue on our end with 1.27
We're seeing this issue too, although ingresses continue to be created.
Looking at the IP's specified, I suspect that the IP's are for nodes that no longer exist.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Still present.
We have that too. Still an issue. Version: aws-load-balancer-controller:v2.8.1
/remove-lifecycle stale
Still present.
Hi @MartinEmrich!
Would be interesting for us which version of aws-lb-controller you are using currently that causes this issue on your side?
Thanks in advance!
Daniel
@awesomeinsight Currently running 2.8.3.
(Update to 2.9.3 underway with the next EKS update)
Seeing the same issue here. It was happening in versions 2.4.5 and now in 2.10.1
2024/12/06 11:52:17 http: TLS handshake error from 172.31.214.156:54968: EOF
2024/12/06 12:12:40 http: TLS handshake error from 172.31.214.156:37102: EOF
2024/12/06 12:12:40 http: TLS handshake error from 172.31.214.156:37116: read tcp 172.31.230.150:9443->172.31.214.156:37116: read: connection reset by peer
Not sure what is causing those logs, but no visible impacts were noticed so far
Ihe IP address belongs to the EKS control plane. I guess they regularly do autoscaling stuff and replace processes doing the calls to the aws-load-balancer-controller. As it is the client, maybe aws-load-balancer-controller should just not log these error?
I'm experiencing the same issue using v2.8.1 and see that it's still occurring in recent versions.
The IPs shown in the logs are not assigned to any node or process. In our cluster, nodes are frequently replaced, so as mentioned earlier in the issue, we think that the ALB Ingress Controller might be attempting to send HTTP requests to nodes that have been removed, causing their IPs to change.
I am experiencing same issue using v2.11.0
Hi folks. This error occurs when EKS rotates your API nodes due to patching or other host related changes. MartinEmrich is correct, https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2914#issuecomment-2523349251.
There should not be any impact from these error messages, although I understand it's frustrating seeing them in the logs. If someone is able to point me to impact from these log messages we can prioritize the fix, in the meantime I don't see us fixing this.