linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

TLS handshake errors in tap and tap-injector logs

Open bwmetcalf opened this issue 9 months ago • 10 comments

What is the issue?

We are seeing TLS handshake error from ... errors in the tap and tap-injector logs:

~  % k get pod -n linkerd-viz -owide tap-845bb754c4-mn75s
NAME                   READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
tap-845bb754c4-mn75s   2/2     Running   0          14d   10.1.128.97   ip-10-1-137-171.us-west-2.compute.internal   <none>           <none>
~  % k logs -n linkerd-viz -c tap tap-845bb754c4-mn75s|grep handshake|head -10
2025/02/13 15:39:17 http: TLS handshake error from 10.1.128.97:51598: EOF
2025/02/13 15:39:17 http: TLS handshake error from 10.1.128.97:51576: EOF
2025/02/13 15:39:17 http: TLS handshake error from 10.1.128.97:51566: EOF
2025/02/13 15:39:17 http: TLS handshake error from 10.1.128.97:51582: EOF
2025/02/23 00:12:23 http: TLS handshake error from 10.1.128.97:43442: EOF
2025/02/23 00:12:23 http: TLS handshake error from 10.1.128.97:43438: EOF
2025/02/23 00:12:42 http: TLS handshake error from 10.1.128.97:60728: EOF
2025/02/23 00:12:42 http: TLS handshake error from 10.1.128.97:60718: EOF
2025/02/23 00:12:43 http: TLS handshake error from 10.1.128.97:60710: EOF
2025/02/23 00:17:08 http: TLS handshake error from 10.1.128.97:51188: EOF
~  % k get pod -n linkerd-viz -owide tap-injector-84899f676-gvwh8
NAME                           READY   STATUS    RESTARTS      AGE   IP             NODE                                        NOMINATED NODE   READINESS GATES
tap-injector-84899f676-gvwh8   2/2     Running   4 (23d ago)   33d   10.1.152.224   ip-10-1-154-46.us-west-2.compute.internal   <none>           <none>
~  % k logs -c tap-injector -n linkerd-viz tap-injector-84899f676-gvwh8|grep handshake|head -10
2025/02/04 18:30:00 http: TLS handshake error from 10.1.152.224:50512: EOF
2025/02/04 18:30:00 http: TLS handshake error from 10.1.152.224:50584: EOF
2025/02/04 19:30:00 http: TLS handshake error from 10.1.152.224:39828: EOF
2025/02/04 19:30:00 http: TLS handshake error from 10.1.152.224:39856: EOF
2025/02/04 19:45:00 http: TLS handshake error from 10.1.152.224:43334: EOF
2025/02/04 20:00:00 http: TLS handshake error from 10.1.152.224:60726: EOF
2025/02/04 20:00:00 http: TLS handshake error from 10.1.152.224:60740: EOF
2025/02/04 20:30:00 http: TLS handshake error from 10.1.152.224:54250: EOF
2025/02/04 20:45:00 http: TLS handshake error from 10.1.152.224:35714: EOF
2025/02/04 21:00:00 http: TLS handshake error from 10.1.152.224:50228: EOF

The IP address reported in the logs for each is the address assigned to the tape and tap-injector pods, respectively.

How can it be reproduced?

Presumably, any vanilla linkerd viz deployment will result in these errors.

Logs, error output, etc

See original description.

output of linkerd check -o short

~  % linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.11.8 but the latest edge version is 25.2.2
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.11.8 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-84f8887748-b424m (edge-24.11.8)
	* linkerd-destination-84f8887748-dgbbw (edge-24.11.8)
	* linkerd-destination-84f8887748-wcs6l (edge-24.11.8)
	* linkerd-identity-66ff997c9-jlzf6 (edge-24.11.8)
	* linkerd-identity-66ff997c9-vt6wv (edge-24.11.8)
	* linkerd-identity-66ff997c9-z8q59 (edge-24.11.8)
	* linkerd-proxy-injector-5959bfcb57-8rddw (edge-24.11.8)
	* linkerd-proxy-injector-5959bfcb57-qt4hq (edge-24.11.8)
	* linkerd-proxy-injector-5959bfcb57-vr9q9 (edge-24.11.8)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-84f8887748-b424m running edge-24.11.8 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-76d9495fb5-5m896 (edge-24.11.8)
	* prometheus-554f465879-5p2v7 (edge-24.11.8)
	* tap-845bb754c4-mn75s (edge-24.11.8)
	* tap-injector-84899f676-gvwh8 (edge-24.11.8)
	* web-66f97d9494-7kb57 (edge-24.11.8)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-76d9495fb5-5m896 running edge-24.11.8 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

% k version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.9-eks-8cce635

linkerd version is edge-24.11.8.

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

bwmetcalf avatar Feb 27 '25 18:02 bwmetcalf

Can you include details on how you installed Linkerd and Viz (Helm? CLI?). Certificates are typically generated during installation (or configured to be managed by something like Cert Manager). How long has this installation been running? Is it possible your certificates are expired?

olix0r avatar Mar 06 '25 15:03 olix0r

We install using helm (with the terraform helm_release resource) and also generate our certs with

resource "tls_private_key" "ca" {
  algorithm   = "ECDSA"
  ecdsa_curve = "P256"
}

resource "tls_self_signed_cert" "ca" {
  is_ca_certificate     = true
  private_key_pem       = tls_private_key.ca.private_key_pem
  set_subject_key_id    = true
  validity_period_hours = 87600 # Set to 10 years, as recommended by the linkerd documentation: https://linkerd.io/2-edge/features/automatic-mtls/#operational-concerns

  allowed_uses = [
    "cert_signing",
    "crl_signing",
  ]

  subject {
    common_name = "root.linkerd.cluster.local"
  }
}

resource "tls_private_key" "issuer" {
  algorithm   = "ECDSA"
  ecdsa_curve = "P256"
}

resource "tls_cert_request" "issuer" {
  private_key_pem = tls_private_key.issuer.private_key_pem

  subject {
    common_name = "identity.linkerd.cluster.local"
  }
}

resource "tls_locally_signed_cert" "issuer" {
  ca_cert_pem           = tls_self_signed_cert.ca.cert_pem
  ca_private_key_pem    = tls_private_key.ca.private_key_pem
  cert_request_pem      = tls_cert_request.issuer.cert_request_pem
  is_ca_certificate     = true
  set_subject_key_id    = true
  validity_period_hours = 8760

  allowed_uses = [
    "cert_signing",
    "crl_signing",
  ]
}

This installation has been in place for about 8 months and is working fine, nothwithstanding the handshake errors. These errors have always occurred in the logs. Our certs are currently valid.

Let me know if you need more information.

bwmetcalf avatar Mar 07 '25 17:03 bwmetcalf

Here's the full check output in case it helps:

% linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all pods
√ cluster networks contains all services

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
    is running version 24.11.8 but the latest edge version is 25.3.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.11.8 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-7465f94559-bdfvj (edge-24.11.8)
	* linkerd-destination-7465f94559-bkzcg (edge-24.11.8)
	* linkerd-destination-7465f94559-rk5qf (edge-24.11.8)
	* linkerd-identity-75ffd59db4-m4b4r (edge-24.11.8)
	* linkerd-identity-75ffd59db4-wtbsl (edge-24.11.8)
	* linkerd-identity-75ffd59db4-zkk22 (edge-24.11.8)
	* linkerd-proxy-injector-7478b6f95c-hfb24 (edge-24.11.8)
	* linkerd-proxy-injector-7478b6f95c-qtffv (edge-24.11.8)
	* linkerd-proxy-injector-7478b6f95c-r5bdz (edge-24.11.8)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-7465f94559-bdfvj running edge-24.11.8 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints
√ multiple replicas of control plane pods

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-596dd558dd-296q5 (edge-24.11.8)
	* prometheus-784784b9ff-dv486 (edge-24.11.8)
	* tap-67c99d96ff-m22cm (edge-24.11.8)
	* tap-injector-9c9f49f9d-8wnxz (edge-24.11.8)
	* web-5d5ccdd666-ws9rb (edge-24.11.8)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-596dd558dd-296q5 running edge-24.11.8 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints
√ prometheus is installed and configured correctly
√ viz extension self-check

Status check results are √

bwmetcalf avatar Mar 07 '25 19:03 bwmetcalf

Can you set the log level for the tap-injector deployment to debug and share back the logs?

alpeb avatar Mar 13 '25 14:03 alpeb

Unfortunately, there isn't much to go on.

Image

The ServeHTTP entries look like

ServeHTTP(): &{Method:GET URL:/apis/tap.linkerd.io/v1alpha1 Proto:HTTP/2.0 ProtoMajor:2 ProtoMinor:0 Header:map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/2.0] X-Remote-Group:[system:masters] X-Remote-User:[system:kube-aggregator]] Body:0xc0015c8840 GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:10.3.171.97:8089 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr:10.3.171.97:56532 RequestURI:/apis/tap.linkerd.io/v1alpha1 TLS:0xc000548630 Cancel:<nil> Response:<nil> ctx:0xc003c802a0 pat:<nil> matches:[] otherValues:map[]}

bwmetcalf avatar Mar 20 '25 15:03 bwmetcalf

I think I can reproduce the same errors using certs from cert-manager, as described in #14059. @bwmetcalf how did you set up the certificates? When I use the self-signed onces, directly generated by the helm chart, it works. As I'm generating equivalent certs with cert-manager, I guess the error lies in tap reading the injected ca-bundle.

Kakadus avatar May 28 '25 21:05 Kakadus

Thanks @Kakadus . This appears to be the same issue. We are generating external certs with the tls_* suite of terraform resources and then providing those to our values.yaml under identityTrustAnchorsPEM and identity. Below is the exact terraform we are using

resource "tls_private_key" "ca" {
  algorithm   = "ECDSA"
  ecdsa_curve = "P256"
}

resource "tls_self_signed_cert" "ca" {
  is_ca_certificate     = true
  private_key_pem       = tls_private_key.ca.private_key_pem
  set_subject_key_id    = true
  validity_period_hours = 87600 # Set to 10 years, as recommended by the linkerd documentation: https://linkerd.io/2-edge/features/automatic-mtls/#operational-concerns

  allowed_uses = [
    "cert_signing",
    "crl_signing",
  ]

  subject {
    common_name = "root.linkerd.cluster.local"
  }
}

resource "tls_private_key" "issuer" {
  algorithm   = "ECDSA"
  ecdsa_curve = "P256"
}

resource "tls_cert_request" "issuer" {
  private_key_pem = tls_private_key.issuer.private_key_pem

  subject {
    common_name = "identity.linkerd.cluster.local"
  }
}

resource "tls_locally_signed_cert" "issuer" {
  ca_cert_pem           = tls_self_signed_cert.ca.cert_pem
  ca_private_key_pem    = tls_private_key.ca.private_key_pem
  cert_request_pem      = tls_cert_request.issuer.cert_request_pem
  is_ca_certificate     = true
  set_subject_key_id    = true
  validity_period_hours = 8760

  allowed_uses = [
    "cert_signing",
    "crl_signing",
  ]
}

bwmetcalf avatar May 29 '25 00:05 bwmetcalf

@bwmetcalf how are you passing those to the linkerd-viz chart?

Kakadus avatar Jun 06 '25 08:06 Kakadus

Currently, we are not. Should we be for viz? And should we define these for the proxy injector? Everything is functioning correctly it seems, but perhaps we've overlooked something.

To be clear, we are only specifying the following in the control plane chart values

identityTrustAnchorsPEM: |
  ${indent(2, identity_trust_anchors_pem)}

identity:
  issuer:
    tls:
      crtPEM: |
        ${indent(8, identity_issuer_crt_pem)}
      keyPEM: |
        ${indent(8, identity_issuer_key_pem)}

The values are generated via terraform with the code I shared above and then passed in to our template with

      identity_issuer_crt_pem    = tls_locally_signed_cert.issuer.cert_pem
      identity_issuer_key_pem    = tls_private_key.issuer.private_key_pem
      identity_trust_anchors_pem = tls_locally_signed_cert.issuer.ca_cert_pem

bwmetcalf avatar Jun 06 '25 17:06 bwmetcalf

Okay, but how is linkerd-viz deployed? So far you only talked about the control plane (if I did not miss something obvious). tap and tap-injector are part of linkerd-viz, not the control plane.

If you are certain that you don't pass any values like

tap:
  crtPEM:
  keyPEM:
tapInjector:
  crtPEM:
  keyPEM:

to the linkerd-viz chart, helm should generate self-signed certificates for you (those certs do work in my cluster™). The PEMs are then stored in a secret named tap-k8s-tls and tap-injector-k8s-tls in the linkerd-viz namespace. If those exists and are valid, then I also don't know further

Kakadus avatar Jun 06 '25 23:06 Kakadus

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 05 '25 04:09 stale[bot]