karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

TLS handshake error from API server

Open sknmi opened this issue 1 year ago • 17 comments

Description

Observed Behavior:

karpenter-c595bb5d8-8r8jr controller {"level":"ERROR","time":"2024-08-30T08:06:16.304Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:40666: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-hzfgs controller {"level":"ERROR","time":"2024-08-30T08:07:18.550Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:58290: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-8r8jr controller {"level":"ERROR","time":"2024-08-30T08:07:18.571Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:55794: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-8r8jr controller {"level":"ERROR","time":"2024-08-30T08:07:18.572Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:55792: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-hzfgs controller {"level":"ERROR","time":"2024-08-30T08:08:10.419Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:43424: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-8r8jr controller {"level":"ERROR","time":"2024-08-30T08:08:10.427Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:52314: EOF\n","commit":"62a726c"}

Expected Behavior: No errors :) Reproduction Steps (Please include YAML): Karpenter on fargate in karpenter namespace. These messages started to appear after upgrading to 1.0.1 Versions:

  • Chart Version: 1.0.1
  • Kubernetes Version (kubectl version): 1.30
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

sknmi avatar Aug 30 '24 08:08 sknmi

fixed with

webhook:
  enabled: false

sknmi avatar Aug 30 '24 09:08 sknmi

I don't think this issue should be closed. I am seeing a similar error in my log messages and require the webhook to remain enabled to facilitate the conversion to the latest api version for my resources.

levinedaniel avatar Sep 10 '24 18:09 levinedaniel

I agree with @levinedaniel. What is the reason to mark solution as closed with

webhook:
  enabled: false

The webhook is broken.

ezh avatar Sep 12 '24 06:09 ezh

Same, v1.0.2. Please re-open.

Is disabling webhook an ok solution or some functionality will not work?

Hronom avatar Sep 20 '24 12:09 Hronom

cc @sknmi message above

Hronom avatar Sep 20 '24 12:09 Hronom

@Hronom reopened :)

sknmi avatar Sep 20 '24 12:09 sknmi

Also seeing this issue after upgrading to v0.37.3.

m0untains avatar Sep 20 '24 23:09 m0untains

Saw this issue on 0.37.3 and 1.0.1

adawalli avatar Sep 23 '24 22:09 adawalli

Seeing same in 1.0.2

AnkitBhalla22 avatar Sep 25 '24 07:09 AnkitBhalla22

Below findings are incorrect

Here is my observation. Please let me know if this is incorrect:

Karpenter does not provide a ca-client bundle as we can see from here.

When I look at the CRD in my cluster, I can see that it has been injected with a caBundle:

 webhook:
      clientConfig:
        caBundle: Redacted...
        service:
          name: karpenter
          namespace: karpenter
          path: /conversion/karpenter.sh
          port: 8443
      conversionReviewVersions:
      - v1beta1
      - v1
  group: karpenter.sh

I believe this is happening through ca-injector. So this means, that client config for this webhook has a ca-bundle specified but karpenter uses knative to inject certificate data into karpernter-certsecret which comes from here.

So this means that CA for CRD & Webhooks do not match and hence the error. If this is correct, then may be we can look at the possible solutions


I am still not sure how CA bundle is injected in CRD and I did see at one point that the CA bundle in secret vs CRD was different.

liafizan avatar Sep 25 '24 22:09 liafizan

This appears to be the same issue we saw with the our defaulting / validating webhooks previously, the original issue was closed out when those webhooks were disabled by default: https://github.com/kubernetes-sigs/karpenter/issues/718. I've been able to reproduce, and as with that issues there does not appear to be any actual impact to Karpenter's operation and the errors can be safely ignored.

From the original issue:

These TLS errors appear to be related to https://github.com/kubernetes/kubernetes/issues/109022 which states that these handshake errors may be generated by some caching mechanism that is happening in the standard library that causes TLS errors on a cert rotation.

@liafizan are you still running into this? The cert is injected by knative, and I've been unable to reproduce. If you're still encountering this, I'd recommend opening a separate issue. I don't think it's related to the TLS errors we're seeing here.

I am still not sure how CA bundle is injected in CRD and I did see at one point that the CA bundle in secret vs CRD was different.

I'm going to mark this issue as solved for now, but let us know if any of you believe this issue is impacting Karpenter's ability to operate.

jmdeal avatar Oct 04 '24 17:10 jmdeal

Hello @jmdeal,

After upgrading to minor 0.37.5 to enable the deleting of webhooks when deployed with ArgoCD I see two things:

  • first the validating and mutating webhooks are now properly deleted using ArgoCD.
  • the second one is that my CRDs are not in version v1 and are still in v1beta1 so IMO the TLS handshake error is causing the conversion webhook to fail, which is a problem with Karpenter migration to v1.0.x. kubectl get crd nodeclaims.karpenter.sh -o jsonpath='{.spec.versions[*].name}' =. v1 v1beta1 / So both versions exist in the cluster. Therefore the TLS handshake error in my case seems to prevent the validating webhook to perform the v1 migration. I checked the logs inside the controller and that is all I got from the webhook ...

laserpedro avatar Oct 07 '24 05:10 laserpedro

the second one is that my CRDs are not in version v1 and are still in v1beta1 so IMO the TLS handshake error is causing the conversion webhook to fail

This doesn't indicate any issue with the conversion webhook. If you're on any pre-1.0 version with the conversion webhooks, the storage version is still v1beta1. The conversion webhooks only exist on those versions to enable rollback from v1.0. Also, once you upgrade to v1, both versions will still be present on the CRD, one isn't automatically removed once all stored resources are converted. Instead, you want to look at .status.storedVersions on the CRDs. On Karpenter v1.0.5+ Karpenter will remove v1beta1 from the stored versions once all CRs have been successfully migrated.

jmdeal avatar Oct 07 '24 15:10 jmdeal

@jmdeal thank you for your answer, I misunderstood the conversion webhook and thought is was the other way around, thanks for the clarification !

laserpedro avatar Oct 07 '24 15:10 laserpedro

We are seeing this same behavior. Upgrade from 0.37.0 to 1.0.3 (with a minor upgrade to 0.37.3 during the upgrade process). The error seems to be innocuous, but I wanted to see if there was any impact to the core functionality of Karpenter.

elihuj117 avatar Oct 07 '24 21:10 elihuj117

I have done the upgrade from 0.37.5 to 1.0.6 and still see this issue. I have enabled webhook in 0.37.5 and this error is from karpenter 1.0.6 {"level":"ERROR","time":"2024-10-09T14:27:06.147Z","logger":"webhook","message":"http: TLS handshake error from 10.214.2.206:34084: EOF\n","commit":"6174c75"} {"level":"ERROR","time":"2024-10-09T14:27:06.319Z","logger":"webhook","message":"http: TLS handshake error from 10.214.60.56:40108: EOF\n","commit":"6174c75"}

apurvabhandari avatar Oct 09 '24 14:10 apurvabhandari

+1

itayvolo avatar Oct 13 '24 16:10 itayvolo

I think this issue is caused by the conversion webhook configured on the CRDs (I have had a hard time with these already with #6818). I use pulumi transforms to remove them, the error is gone:

    transforms: [
      ({ props, opts, type }) => {
        if (type === "kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition") {
          // Disable Karpenter conversion webhooks which was only useful when upgrading to v1 and now causes errors
          props.spec.conversion = undefined;
          return { props, opts };
        }
        return undefined;
      }
    ]

awoimbee avatar Oct 23 '24 10:10 awoimbee

hi, I did the karpenter version upgrade from v0.33.10 to v1.0.3 following the upgrade guide, https://karpenter.sh/docs/upgrading/v1-migration/#upgrade-procedure, but as mentioned above by others, ran into the TLS error, but without any impact on the karpenter functionalities.

{"level":"ERROR","time":"2024-11-01T05:16:43.587Z","logger":"webhook","message":"http: TLS handshake error from 100.x.x.x:32858: read tcp 100.x.x.x:8443->100.x.x.x:32858: read: connection reset by peer\n","commit":"688ea21"}
{"level":"ERROR","time":"2024-11-01T05:16:43.590Z","logger":"webhook","message":"http: TLS handshake error from 100.x.x.x:32876: read tcp 100.x.x.x:8443->100.x.x.x:32876: read: connection reset by peer\n","commit":"688ea21"}

i was able to ignore the errors by disabling the webhook by setting DISABLE_WEBHOOK=true. but as mentioned in the below thread, i am also not sure on the repercussions of this. https://github.com/kubernetes-sigs/karpenter/issues/718#issuecomment-2447546036

following the discussions in threads, i believe these webhooks are necessary to migrate the api from v1beta1 to v1 in future release. can someone comment on this.

ajith-thomas-fw avatar Nov 01 '24 12:11 ajith-thomas-fw

This issue has been inactive for 7 days and is marked as "triage/solved". StaleBot will close this stale issue after 7 more days of inactivity.

github-actions[bot] avatar Dec 10 '24 22:12 github-actions[bot]

I would like to hear clarifications about this from developers. Specifically what is the recommended way if you use latest version of karpenter.

I still don't understand for what webhooks is used for and if I need to keep them enabled in latest version of karpenter.

Hronom avatar Dec 15 '24 16:12 Hronom

This issue has been inactive for 7 days and is marked as "triage/solved". StaleBot will close this stale issue after 7 more days of inactivity.

github-actions[bot] avatar Dec 24 '24 12:12 github-actions[bot]

One of the things I notice is that if we run a single replica of Karpenter, this error goes away. Not a recommendation, but reporting an observation if it helps the investigation.

prad9192 avatar Dec 26 '24 04:12 prad9192

This issue has been inactive for 7 days and is marked as "triage/solved". StaleBot will close this stale issue after 7 more days of inactivity.

github-actions[bot] avatar Jan 02 '25 12:01 github-actions[bot]

I hope this error will be gone with update to 1.1, which should support only v1 API.

nantiferov avatar Jan 02 '25 12:01 nantiferov

This issue has been inactive for 7 days and is marked as "triage/solved". StaleBot will close this stale issue after 7 more days of inactivity.

github-actions[bot] avatar Jan 10 '25 12:01 github-actions[bot]

devs dead or why they are not responding? what are these webhooks, is disabling them safe?

korncola avatar Jan 13 '25 15:01 korncola

Issue exists after upgrading to Karpenter v1.1.1 , it is quite misleading and pollutes our logs. Do you'l recommend to turn off the webhook?

"message":"http: TLS handshake error from [2a05:d014:3b8:5c05::221b]:41978: EOF\n","commit":"a2875e3"}

ajaykumarmandapati avatar Jan 20 '25 19:01 ajaykumarmandapati

This issue has been inactive for 7 days and is marked as "triage/solved". StaleBot will close this stale issue after 7 more days of inactivity.

github-actions[bot] avatar Jan 28 '25 12:01 github-actions[bot]

How will maintainers know about an issue if it is auto-closed?

Hronom avatar Jan 28 '25 13:01 Hronom