TLS handshake errors in tap and tap-injector logs
What is the issue?
We are seeing TLS handshake error from ... errors in the tap and tap-injector logs:
~ % k get pod -n linkerd-viz -owide tap-845bb754c4-mn75s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tap-845bb754c4-mn75s 2/2 Running 0 14d 10.1.128.97 ip-10-1-137-171.us-west-2.compute.internal <none> <none>
~ % k logs -n linkerd-viz -c tap tap-845bb754c4-mn75s|grep handshake|head -10
2025/02/13 15:39:17 http: TLS handshake error from 10.1.128.97:51598: EOF
2025/02/13 15:39:17 http: TLS handshake error from 10.1.128.97:51576: EOF
2025/02/13 15:39:17 http: TLS handshake error from 10.1.128.97:51566: EOF
2025/02/13 15:39:17 http: TLS handshake error from 10.1.128.97:51582: EOF
2025/02/23 00:12:23 http: TLS handshake error from 10.1.128.97:43442: EOF
2025/02/23 00:12:23 http: TLS handshake error from 10.1.128.97:43438: EOF
2025/02/23 00:12:42 http: TLS handshake error from 10.1.128.97:60728: EOF
2025/02/23 00:12:42 http: TLS handshake error from 10.1.128.97:60718: EOF
2025/02/23 00:12:43 http: TLS handshake error from 10.1.128.97:60710: EOF
2025/02/23 00:17:08 http: TLS handshake error from 10.1.128.97:51188: EOF
~ % k get pod -n linkerd-viz -owide tap-injector-84899f676-gvwh8
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tap-injector-84899f676-gvwh8 2/2 Running 4 (23d ago) 33d 10.1.152.224 ip-10-1-154-46.us-west-2.compute.internal <none> <none>
~ % k logs -c tap-injector -n linkerd-viz tap-injector-84899f676-gvwh8|grep handshake|head -10
2025/02/04 18:30:00 http: TLS handshake error from 10.1.152.224:50512: EOF
2025/02/04 18:30:00 http: TLS handshake error from 10.1.152.224:50584: EOF
2025/02/04 19:30:00 http: TLS handshake error from 10.1.152.224:39828: EOF
2025/02/04 19:30:00 http: TLS handshake error from 10.1.152.224:39856: EOF
2025/02/04 19:45:00 http: TLS handshake error from 10.1.152.224:43334: EOF
2025/02/04 20:00:00 http: TLS handshake error from 10.1.152.224:60726: EOF
2025/02/04 20:00:00 http: TLS handshake error from 10.1.152.224:60740: EOF
2025/02/04 20:30:00 http: TLS handshake error from 10.1.152.224:54250: EOF
2025/02/04 20:45:00 http: TLS handshake error from 10.1.152.224:35714: EOF
2025/02/04 21:00:00 http: TLS handshake error from 10.1.152.224:50228: EOF
The IP address reported in the logs for each is the address assigned to the tape and tap-injector pods, respectively.
How can it be reproduced?
Presumably, any vanilla linkerd viz deployment will result in these errors.
Logs, error output, etc
See original description.
output of linkerd check -o short
~ % linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
unsupported version channel: stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 24.11.8 but the latest edge version is 25.2.2
see https://linkerd.io/2.14/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-24.11.8 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-84f8887748-b424m (edge-24.11.8)
* linkerd-destination-84f8887748-dgbbw (edge-24.11.8)
* linkerd-destination-84f8887748-wcs6l (edge-24.11.8)
* linkerd-identity-66ff997c9-jlzf6 (edge-24.11.8)
* linkerd-identity-66ff997c9-vt6wv (edge-24.11.8)
* linkerd-identity-66ff997c9-z8q59 (edge-24.11.8)
* linkerd-proxy-injector-5959bfcb57-8rddw (edge-24.11.8)
* linkerd-proxy-injector-5959bfcb57-qt4hq (edge-24.11.8)
* linkerd-proxy-injector-5959bfcb57-vr9q9 (edge-24.11.8)
see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-84f8887748-b424m running edge-24.11.8 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints
linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-76d9495fb5-5m896 (edge-24.11.8)
* prometheus-554f465879-5p2v7 (edge-24.11.8)
* tap-845bb754c4-mn75s (edge-24.11.8)
* tap-injector-84899f676-gvwh8 (edge-24.11.8)
* web-66f97d9494-7kb57 (edge-24.11.8)
see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
metrics-api-76d9495fb5-5m896 running edge-24.11.8 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
% k version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.9-eks-8cce635
linkerd version is edge-24.11.8.
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
Can you include details on how you installed Linkerd and Viz (Helm? CLI?). Certificates are typically generated during installation (or configured to be managed by something like Cert Manager). How long has this installation been running? Is it possible your certificates are expired?
We install using helm (with the terraform helm_release resource) and also generate our certs with
resource "tls_private_key" "ca" {
algorithm = "ECDSA"
ecdsa_curve = "P256"
}
resource "tls_self_signed_cert" "ca" {
is_ca_certificate = true
private_key_pem = tls_private_key.ca.private_key_pem
set_subject_key_id = true
validity_period_hours = 87600 # Set to 10 years, as recommended by the linkerd documentation: https://linkerd.io/2-edge/features/automatic-mtls/#operational-concerns
allowed_uses = [
"cert_signing",
"crl_signing",
]
subject {
common_name = "root.linkerd.cluster.local"
}
}
resource "tls_private_key" "issuer" {
algorithm = "ECDSA"
ecdsa_curve = "P256"
}
resource "tls_cert_request" "issuer" {
private_key_pem = tls_private_key.issuer.private_key_pem
subject {
common_name = "identity.linkerd.cluster.local"
}
}
resource "tls_locally_signed_cert" "issuer" {
ca_cert_pem = tls_self_signed_cert.ca.cert_pem
ca_private_key_pem = tls_private_key.ca.private_key_pem
cert_request_pem = tls_cert_request.issuer.cert_request_pem
is_ca_certificate = true
set_subject_key_id = true
validity_period_hours = 8760
allowed_uses = [
"cert_signing",
"crl_signing",
]
}
This installation has been in place for about 8 months and is working fine, nothwithstanding the handshake errors. These errors have always occurred in the logs. Our certs are currently valid.
Let me know if you need more information.
Here's the full check output in case it helps:
% linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all pods
√ cluster networks contains all services
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
unsupported version channel: stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-version-cli for hints
control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
is running version 24.11.8 but the latest edge version is 25.3.1
see https://linkerd.io/2.14/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-24.11.8 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-7465f94559-bdfvj (edge-24.11.8)
* linkerd-destination-7465f94559-bkzcg (edge-24.11.8)
* linkerd-destination-7465f94559-rk5qf (edge-24.11.8)
* linkerd-identity-75ffd59db4-m4b4r (edge-24.11.8)
* linkerd-identity-75ffd59db4-wtbsl (edge-24.11.8)
* linkerd-identity-75ffd59db4-zkk22 (edge-24.11.8)
* linkerd-proxy-injector-7478b6f95c-hfb24 (edge-24.11.8)
* linkerd-proxy-injector-7478b6f95c-qtffv (edge-24.11.8)
* linkerd-proxy-injector-7478b6f95c-r5bdz (edge-24.11.8)
see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-7465f94559-bdfvj running edge-24.11.8 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints
linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints
√ multiple replicas of control plane pods
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-596dd558dd-296q5 (edge-24.11.8)
* prometheus-784784b9ff-dv486 (edge-24.11.8)
* tap-67c99d96ff-m22cm (edge-24.11.8)
* tap-injector-9c9f49f9d-8wnxz (edge-24.11.8)
* web-5d5ccdd666-ws9rb (edge-24.11.8)
see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
metrics-api-596dd558dd-296q5 running edge-24.11.8 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints
√ prometheus is installed and configured correctly
√ viz extension self-check
Status check results are √
Can you set the log level for the tap-injector deployment to debug and share back the logs?
Unfortunately, there isn't much to go on.
The ServeHTTP entries look like
ServeHTTP(): &{Method:GET URL:/apis/tap.linkerd.io/v1alpha1 Proto:HTTP/2.0 ProtoMajor:2 ProtoMinor:0 Header:map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/2.0] X-Remote-Group:[system:masters] X-Remote-User:[system:kube-aggregator]] Body:0xc0015c8840 GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:10.3.171.97:8089 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr:10.3.171.97:56532 RequestURI:/apis/tap.linkerd.io/v1alpha1 TLS:0xc000548630 Cancel:<nil> Response:<nil> ctx:0xc003c802a0 pat:<nil> matches:[] otherValues:map[]}
I think I can reproduce the same errors using certs from cert-manager, as described in #14059. @bwmetcalf how did you set up the certificates? When I use the self-signed onces, directly generated by the helm chart, it works. As I'm generating equivalent certs with cert-manager, I guess the error lies in tap reading the injected ca-bundle.
Thanks @Kakadus . This appears to be the same issue. We are generating external certs with the tls_* suite of terraform resources and then providing those to our values.yaml under identityTrustAnchorsPEM and identity. Below is the exact terraform we are using
resource "tls_private_key" "ca" {
algorithm = "ECDSA"
ecdsa_curve = "P256"
}
resource "tls_self_signed_cert" "ca" {
is_ca_certificate = true
private_key_pem = tls_private_key.ca.private_key_pem
set_subject_key_id = true
validity_period_hours = 87600 # Set to 10 years, as recommended by the linkerd documentation: https://linkerd.io/2-edge/features/automatic-mtls/#operational-concerns
allowed_uses = [
"cert_signing",
"crl_signing",
]
subject {
common_name = "root.linkerd.cluster.local"
}
}
resource "tls_private_key" "issuer" {
algorithm = "ECDSA"
ecdsa_curve = "P256"
}
resource "tls_cert_request" "issuer" {
private_key_pem = tls_private_key.issuer.private_key_pem
subject {
common_name = "identity.linkerd.cluster.local"
}
}
resource "tls_locally_signed_cert" "issuer" {
ca_cert_pem = tls_self_signed_cert.ca.cert_pem
ca_private_key_pem = tls_private_key.ca.private_key_pem
cert_request_pem = tls_cert_request.issuer.cert_request_pem
is_ca_certificate = true
set_subject_key_id = true
validity_period_hours = 8760
allowed_uses = [
"cert_signing",
"crl_signing",
]
}
@bwmetcalf how are you passing those to the linkerd-viz chart?
Currently, we are not. Should we be for viz? And should we define these for the proxy injector? Everything is functioning correctly it seems, but perhaps we've overlooked something.
To be clear, we are only specifying the following in the control plane chart values
identityTrustAnchorsPEM: |
${indent(2, identity_trust_anchors_pem)}
identity:
issuer:
tls:
crtPEM: |
${indent(8, identity_issuer_crt_pem)}
keyPEM: |
${indent(8, identity_issuer_key_pem)}
The values are generated via terraform with the code I shared above and then passed in to our template with
identity_issuer_crt_pem = tls_locally_signed_cert.issuer.cert_pem
identity_issuer_key_pem = tls_private_key.issuer.private_key_pem
identity_trust_anchors_pem = tls_locally_signed_cert.issuer.ca_cert_pem
Okay, but how is linkerd-viz deployed? So far you only talked about the control plane (if I did not miss something obvious). tap and tap-injector are part of linkerd-viz, not the control plane.
If you are certain that you don't pass any values like
tap:
crtPEM:
keyPEM:
tapInjector:
crtPEM:
keyPEM:
to the linkerd-viz chart, helm should generate self-signed certificates for you (those certs do work in my cluster™). The PEMs are then stored in a secret named tap-k8s-tls and tap-injector-k8s-tls in the linkerd-viz namespace. If those exists and are valid, then I also don't know further
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.