linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

TLS handshake timeout in identity pod

Open tensor5 opened this issue 2 years ago • 8 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

What is the issue?

I'm trying to install Linkerd using the Helm chart, but the linkerd-identity pod crashes, and its log shows

time="2021-12-11T16:04:23Z" level=info msg="running version edge-21.12.2"
time="2021-12-11T16:04:33Z" level=fatal msg="Failed to initialize identity service: Post \"https://10.32.0.1:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\": net/http: TLS handshake timeout"

The other two pods, linkerd-destination and linkerd-proxy-injector are stuck in ContainerCreating.

I tried also the stable branch, v2.11.1, with and without the Helm chart, but the result is the same.

How can it be reproduced?

helm install \
  --set-file identityTrustAnchorsPEM=ca.crt \
  --set-file identity.issuer.tls.crtPEM=issuer.crt \
  --set-file identity.issuer.tls.keyPEM=issuer.key \
  linkerd/linkerd2

Logs, error output, etc

time="2021-12-11T16:04:23Z" level=info msg="running version edge-21.12.2"
time="2021-12-11T16:04:33Z" level=fatal msg="Failed to initialize identity service: Post \"https://10.32.0.1:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\": net/http: TLS handshake timeout"

output of linkerd check -o short

Linkerd core checks
===================

linkerd-existence
-----------------
× control plane pods are ready
    No running pods for "linkerd-destination"
    see https://linkerd.io/2.11/checks/#l5d-api-control-ready for hints

Status check results are ×

Environment

  • 1.23.0 (and 1.22.3)
  • Scaleway
  • Linux
  • 21.12.2 (and 2.11.1)

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

tensor5 avatar Dec 11 '21 16:12 tensor5

@tensor5 How did you create the CA and issuer certs that you are installing with? It sounds like there may be an issue with their creation which is why the identity controller is failing to start up.

kleimkuhler avatar Dec 13 '21 19:12 kleimkuhler

It's also possible that there is an issue with connecting to the k8s API. The log line Failed to initialize identity service: Post \"https://10.32.0.1:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews\": net/http: TLS handshake timeout indicating there is a timeout could be the issue here.

Do you have a way to confirm Pods on your cluster can communicate with the k8s API successfully? If you uninstall the Linkerd resources, are there any warnings/errors with linkerd check --pre?

kleimkuhler avatar Dec 14 '21 17:12 kleimkuhler

linkerd check --pre is all green.

This is the log of the linkerd-proxy in linkerd-identity

...
[ 259.035640s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 259.537692s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 260.004001s] WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN
[ 260.039702s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 260.541779s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 261.043786s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 261.545786s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 262.047780s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 262.549805s] WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

Switching from cilium to calico in Scaleway cluster configuration solves the problem.

tensor5 avatar Dec 18 '21 01:12 tensor5

@tensor5 This sounds likely to be a cillium configuration issue? It sounds like the controllers were unable to contact the Kubernetes API Server. I'm not sure what we can change in Linkerd to address this.

olix0r avatar Dec 21 '21 17:12 olix0r

I'Linkerd was working fine for me before in another cluster using calico. I'm experiencing the exact same issue on a new cluster with cilium. The identity pod is unable to connect to the API, yet all other pods that need the kube-api can make calls and receive responses ok. Linkerd is the only one that can't

I had a look at https://github.com/linkerd/linkerd2/issues/6246 but I don't know what I should try

I've also checkout out https://github.com/linkerd/linkerd2/issues/6238, and there has been progress in the cilium space as the awaited PR is now merged. However, this setting doesn't seem to work. Cilium docs only mention it's designed for Istio, so I'm not sure if it covers all that Linkerd needs

cortopy avatar Jan 14 '22 18:01 cortopy

Environment:

  • Kubernetes 1.20.6
  • Cilium 1.9.6 (hostServices.enabled=false)
  • Linkerd (skip port 6443)

I have the same issue. How can I work around it and make it works? do I need to wait for 2.12.0?

sockyone avatar Feb 08 '22 18:02 sockyone

cross referencing #7786

ghouscht avatar Feb 24 '22 15:02 ghouscht

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 25 '22 18:05 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 05 '22 21:10 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 04 '23 22:01 stale[bot]

We recently closed #9817 which is a potential fix for this issue. If you are able to confirm that fix that would be helpful. We'll keep this open for a little bit longer.

kleimkuhler avatar Jan 05 '23 22:01 kleimkuhler