linkerd2 TLS handshake timeout in identity pod

Is there an existing issue for this?

[X] I have searched the existing issues

What is the issue?

I'm trying to install Linkerd using the Helm chart, but the linkerd-identity pod crashes, and its log shows

time="2021-12-11T16:04:23Z" level=info msg="running version edge-21.12.2"
time="2021-12-11T16:04:33Z" level=fatal msg="Failed to initialize identity service: Post \"https://10.32.0.1:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\": net/http: TLS handshake timeout"

The other two pods, linkerd-destination and linkerd-proxy-injector are stuck in ContainerCreating.

I tried also the stable branch, v2.11.1, with and without the Helm chart, but the result is the same.

How can it be reproduced?

helm install \
  --set-file identityTrustAnchorsPEM=ca.crt \
  --set-file identity.issuer.tls.crtPEM=issuer.crt \
  --set-file identity.issuer.tls.keyPEM=issuer.key \
  linkerd/linkerd2

Logs, error output, etc

time="2021-12-11T16:04:23Z" level=info msg="running version edge-21.12.2"
time="2021-12-11T16:04:33Z" level=fatal msg="Failed to initialize identity service: Post \"https://10.32.0.1:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\": net/http: TLS handshake timeout"

output of `linkerd check -o short`

Linkerd core checks
===================

linkerd-existence
-----------------
× control plane pods are ready
    No running pods for "linkerd-destination"
    see https://linkerd.io/2.11/checks/#l5d-api-control-ready for hints

Status check results are ×

Environment

1.23.0 (and 1.22.3)
Scaleway
Linux
21.12.2 (and 2.11.1)

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

Dec 11 '21 16:12 tensor5

@tensor5 How did you create the CA and issuer certs that you are installing with? It sounds like there may be an issue with their creation which is why the identity controller is failing to start up.

Dec 13 '21 19:12 kleimkuhler

It's also possible that there is an issue with connecting to the k8s API. The log line Failed to initialize identity service: Post \"https://10.32.0.1:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\": net/http: TLS handshake timeout indicating there is a timeout could be the issue here.

Do you have a way to confirm Pods on your cluster can communicate with the k8s API successfully? If you uninstall the Linkerd resources, are there any warnings/errors with linkerd check --pre?

Dec 14 '21 17:12 kleimkuhler

linkerd check --pre is all green.

This is the log of the linkerd-proxy in linkerd-identity

...
[ 259.035640s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 259.537692s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 260.004001s] WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN
[ 260.039702s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 260.541779s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 261.043786s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 261.545786s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 262.047780s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 262.549805s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

Switching from cilium to calico in Scaleway cluster configuration solves the problem.

Dec 18 '21 01:12 tensor5

@tensor5 This sounds likely to be a cillium configuration issue? It sounds like the controllers were unable to contact the Kubernetes API Server. I'm not sure what we can change in Linkerd to address this.

Dec 21 '21 17:12 olix0r

I'Linkerd was working fine for me before in another cluster using calico. I'm experiencing the exact same issue on a new cluster with cilium. The identity pod is unable to connect to the API, yet all other pods that need the kube-api can make calls and receive responses ok. Linkerd is the only one that can't

I had a look at https://github.com/linkerd/linkerd2/issues/6246 but I don't know what I should try

I've also checkout out https://github.com/linkerd/linkerd2/issues/6238, and there has been progress in the cilium space as the awaited PR is now merged. However, this setting doesn't seem to work. Cilium docs only mention it's designed for Istio, so I'm not sure if it covers all that Linkerd needs

Jan 14 '22 18:01 cortopy

Environment:

Kubernetes 1.20.6
Cilium 1.9.6 (hostServices.enabled=false)
Linkerd (skip port 6443)

I have the same issue. How can I work around it and make it works? do I need to wait for 2.12.0?

Feb 08 '22 18:02 sockyone

cross referencing #7786

Feb 24 '22 15:02 ghouscht

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

May 25 '22 18:05 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Oct 05 '22 21:10 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Jan 04 '23 22:01 stale[bot]

We recently closed #9817 which is a potential fix for this issue. If you are able to confirm that fix that would be helpful. We'll keep this open for a little bit longer.

Jan 05 '23 22:01 kleimkuhler

linkerd2 linkerd2 copied to clipboard

TLS handshake timeout in identity pod

Is there an existing issue for this?

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`