aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

TLS handshake error

Open pmichna opened this issue 2 years ago • 25 comments

Describe the bug I get the following errors in aws-load-balancer-controller pods:

http: TLS handshake error from 172.16.128.63:37094: EOF
http: TLS handshake error from 172.16.116.101:58040: EOF

This happens even if there are no ALB Ingresses (nor NLB Services) handled by the controller. This happens on an AWS EKS cluster that is pretty big (~100 nodes, 13k pods). I created two clusters with the same config but with no load deployed there I don't have such errors there.

Steps to reproduce Install AWS Load Balancer controller via Helm chart ver. 1.4.6 with the following values:

resources:
  requests:
    cpu: 100m
    memory: 1Gi
  limits:
    cpu: 200m
    memory: 1Gi

nodeSelector:
  type: privileged

tolerations:
  - key: privileged
    operator: Exists
    effect: NoExecute

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

replicaCount: 2

podDisruptionBudget:
  minAvailable: 1

region: us-east-1

vpcId: <REDACTED>

image:
  repository:  602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-load-balancer-controller

clusterName: <REDACTED>

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: <REDACTED>
    name: aws-load-balancer-controller

Expected outcome No TLS handshake errors.

Environment

  • AWS Load Balancer controller version - 2.4.5 (Helm chart version 1.4.6)
  • Kubernetes version - EKS 1.23

pmichna avatar Dec 08 '22 13:12 pmichna

@pmichna, are these IPs 172.16.128.63 and 172.16.116.101 are assigned to your api servers? The webhook ports require TLS 1.3 and are for access only by the api servers.

kishorj avatar Dec 14 '22 19:12 kishorj

@kishorj These IPs are used by ENIs that are not provisioned by us. We suspect they are provisioned by EKS. We can see them in AWS Console, their status is "in-use". However, none of the pods and none of the nodes have these IPs. I checked that with:

kubectl get pods --all-namespaces -o wide | grep "172\.16\.128\.63"
kubectl get nodes --all-namespaces -o wide | grep "172\.16\.128\.63"

When searching for these IPs in EKS logs we can see this:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       @timestamp        |                                                                                                                                                               @message                                                                                                                                                                |
|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 2022-12-21 13:37:37.000 | E1221 13:37:37.291530      11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.52.193:33774: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc2e47ee990), encoder:(*versioning.codec)(0xc2554b4280), buf:(*bytes.Buffer)(0xc02372a240)}) |
| 2022-12-21 13:35:34.000 | E1221 13:35:34.407518      11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.38.52:42914: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc300ab6f18), encoder:(*versioning.codec)(0xc192e8ea00), buf:(*bytes.Buffer)(0xc19c9be300)})  |
| 2022-12-21 12:47:32.000 | E1221 12:47:32.871496      11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.20.145:53350: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc3434812f0), encoder:(*versioning.codec)(0xc1f6dd2820), buf:(*bytes.Buffer)(0xc2c74a06c0)}) |
| 2022-12-21 12:16:04.000 | E1221 12:16:04.619496      11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.33.240:40910: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc340ce3068), encoder:(*versioning.codec)(0xc226284460), buf:(*bytes.Buffer)(0xc01d39e1b0)}) |
| 2022-12-21 12:05:35.000 | E1221 12:05:35.879502      11 watch.go:248] unable to encode watch object *v1.WatchEvent: write tcp 172.16.128.63:443->172.16.65.188:54982: write: connection timed out (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc2be9850e0), encoder:(*versioning.codec)(0xc0890d8b40), buf:(*bytes.Buffer)(0xc190f4b680)}) |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pmichna avatar Dec 21 '22 14:12 pmichna

I have 7 ingresses (4 pointing to an external ALB and 3 to the internal ALB) set up and I'm seeing similar behavior. I've gotten thousands of these errors over the last 15 days (the longest I can check back in my logs).

Besides the EOF error there's a huge amount of "read tpc connection reset by peer" messages mixed in around the same time.

I'm also using Helm version 1.4.6. Pretty sure this has been happening before that though because I updated from 1.4.5 within 15 days and the error messages happened then too.

nickjj avatar Jan 11 '23 16:01 nickjj

we are on v1.4.8 and started seeing this recently too. multiple clusters affected. running two replicas of LBC. not much load or number of pods in these cluster. one internet-facing ALB and one internal ALB.

FernandoMiguel avatar Mar 08 '23 16:03 FernandoMiguel

We are running EKS v1.25 w/ aws-load-balancer-controller v1.5.1 (deployed by Helm) and the errors are still there, however they are written as info level, not error.

For the most part, they don't seem to affect the controller's ability to register targets, but it still looks "scary" to see TLS handshake errors in the logs:

{"level":"info","ts":"2023-04-25T10:34:09Z","msg":"registered targets","arn":"arn:aws:elasticloadbalancing:eu-west-1:$AWS_ACCOUNT_ID:targetgroup/$TG_NAME/$TG_ID"}

2023/04/25 10:34:00 http: TLS handshake error from $KUBE_API_SERVER_IP:49208: EOF

{"level":"info","ts":"2023-04-25T10:34:32Z","msg":"registered targets","arn":"arn:aws:elasticloadbalancing:eu-west-1:$AWS_ACCOUNT_ID:targetgroup/$TG_NAME/$TG_ID"}

2023/04/25 10:35:00 http: TLS handshake error from $KUBE_API_SERVER_IP:42670: EOF

What's strangest, on the Kube API side of things, logs do not seem to show or indicate any communication issue.

I'd expect either the Kube API or kube-controller-manager logs to produce error messages when trying to communicate with the AWS load-balancer-controller's webhook service.

pisces-period avatar Apr 25 '23 11:04 pisces-period

Same here. EKS 1.24, AWS LBC 2.5.3

2023/06/27 07:54:00 http: TLS handshake error from 10.112.7.197:49090: EOF

10.112.7.197 - is the IP of the EKS API Server.

alt-dima avatar Jun 27 '23 08:06 alt-dima

Also seeing the same intermittently

jukie avatar Jul 07 '23 21:07 jukie

Although aws-load-balancer-controller pod was deployed on Private Subnet with NAT Gateway, Security Groups all ports open from worker node to controlplane, NACL's no restrictions.

Restarting the load balancer controller pod addressed the issue. previous pod was 23 days old.

Hope this helps someone.

saiteja313 avatar Nov 01 '23 16:11 saiteja313

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 31 '24 14:01 k8s-triage-robot

/remove-lifecycle stale

Also seing this on two distinct EKS 1.27 clusters, both with many ingresses. The IP addresses belong to the AWS EKS managed control plane.

MartinEmrich avatar Feb 19 '24 15:02 MartinEmrich

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 19 '24 16:05 k8s-triage-robot

/remove-lifecycle stale

same issue on our end with 1.27

schlags avatar Jun 05 '24 00:06 schlags

We're seeing this issue too, although ingresses continue to be created.

Looking at the IP's specified, I suspect that the IP's are for nodes that no longer exist.

dgard1981 avatar Jul 18 '24 11:07 dgard1981

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 16 '24 11:10 k8s-triage-robot

/remove-lifecycle stale

Still present.

MartinEmrich avatar Oct 16 '24 12:10 MartinEmrich

We have that too. Still an issue. Version: aws-load-balancer-controller:v2.8.1

awesomeinsight avatar Nov 05 '24 07:11 awesomeinsight

/remove-lifecycle stale

Still present.

Hi @MartinEmrich!

Would be interesting for us which version of aws-lb-controller you are using currently that causes this issue on your side?

Thanks in advance!

Daniel

awesomeinsight avatar Nov 05 '24 07:11 awesomeinsight

@awesomeinsight Currently running 2.8.3.

(Update to 2.9.3 underway with the next EKS update)

MartinEmrich avatar Nov 05 '24 10:11 MartinEmrich

Seeing the same issue here. It was happening in versions 2.4.5 and now in 2.10.1

2024/12/06 11:52:17 http: TLS handshake error from 172.31.214.156:54968: EOF                                                                                                                                                                                 
2024/12/06 12:12:40 http: TLS handshake error from 172.31.214.156:37102: EOF                                                                                                                                                                                 
2024/12/06 12:12:40 http: TLS handshake error from 172.31.214.156:37116: read tcp 172.31.230.150:9443->172.31.214.156:37116: read: connection reset by peer

Not sure what is causing those logs, but no visible impacts were noticed so far

andremissaglia avatar Dec 06 '24 14:12 andremissaglia

Ihe IP address belongs to the EKS control plane. I guess they regularly do autoscaling stuff and replace processes doing the calls to the aws-load-balancer-controller. As it is the client, maybe aws-load-balancer-controller should just not log these error?

MartinEmrich avatar Dec 06 '24 14:12 MartinEmrich

I'm experiencing the same issue using v2.8.1 and see that it's still occurring in recent versions.

The IPs shown in the logs are not assigned to any node or process. In our cluster, nodes are frequently replaced, so as mentioned earlier in the issue, we think that the ALB Ingress Controller might be attempting to send HTTP requests to nodes that have been removed, causing their IPs to change.

augustobor avatar Feb 14 '25 13:02 augustobor

I am experiencing same issue using v2.11.0

yesidevelop avatar Feb 18 '25 20:02 yesidevelop

Hi folks. This error occurs when EKS rotates your API nodes due to patching or other host related changes. MartinEmrich is correct, https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2914#issuecomment-2523349251.

There should not be any impact from these error messages, although I understand it's frustrating seeing them in the logs. If someone is able to point me to impact from these log messages we can prioritize the fix, in the meantime I don't see us fixing this.

zac-nixon avatar Feb 24 '25 23:02 zac-nixon