linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

linkerd2 (2.11) control plane pod failure on k8s 1.21

Open bothra90 opened this issue 3 years ago • 15 comments

What is the issue?

When installing linkerd2 (version 2.11) on k8s 1.21 (EKS running on AWS) the control plane services fail to come up.

How can it be reproduced?

I'm installing linkerd2 via helm here, passing in manually generated the cert/keys as flags to helm.

The same setup has worked for us when running linkerd2 version 2.9 on k8s 1.18 and 1.19.

Logs, error output, etc

; k logs pods/linkerd-destination-6b4bfb9f87-hpvg4 -n linkerd linkerd-proxy
time="2022-01-28T18:13:19Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2022-01-28T18:13:19Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[     0.001141s] ERROR ThreadId(01) linkerd_app::env: Could not read LINKERD2_PROXY_IDENTITY_TOKEN_FILE: Permission denied (os error 13)
[     0.001176s] ERROR ThreadId(01) linkerd_app::env: LINKERD2_PROXY_IDENTITY_TOKEN_FILE="/var/run/secrets/kubernetes.io/serviceaccount/token" is not valid: InvalidTokenSource
Invalid configuration: invalid environment variable

output of linkerd check -o short

Linkerd core checks
===================

linkerd-existence
-----------------
\ pod/linkerd-destination-6b4bfb9f87-hpvg4 container sp-validator is not ready

Environment

Kubernetes: 1.21 Host Env: EKS/AWS Linkerd version: 2.11 HostOs: Amazon Linux2

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

bothra90 avatar Jan 28 '22 18:01 bothra90

@bothra90 It looks like that pod doesn't have a service account token so it can't authenticate to obtain its identity certificate.

Can you confirm that these resources exist:

:; k get sa -n linkerd
NAME                     SECRETS   AGE
default                  1         2d12h
linkerd-identity         1         2d12h
linkerd-destination      1         2d12h
linkerd-heartbeat        1         2d12h
linkerd-proxy-injector   1         2d12h
:; k get secret -n linkerd --field-selector 'type==kubernetes.io/service-account-token'
NAME                                 TYPE                                  DATA   AGE
default-token-sbq77                  kubernetes.io/service-account-token   3      2d12h
linkerd-identity-token-2kmvz         kubernetes.io/service-account-token   3      2d12h
linkerd-destination-token-dzjcl      kubernetes.io/service-account-token   3      2d12h
linkerd-heartbeat-token-kbdhf        kubernetes.io/service-account-token   3      2d12h
linkerd-proxy-injector-token-v7j8l   kubernetes.io/service-account-token   3      2d12h

olix0r avatar Jan 28 '22 18:01 olix0r

@olix0r : yes, confirmed that all secrets and service accounts exist

bothra90 avatar Jan 28 '22 19:01 bothra90

I also see that the token is mountable by the linkerd-destination service account.

; k describe -n linkerd sa/linkerd-destination
Name:                linkerd-destination
Namespace:           linkerd
Labels:              app.kubernetes.io/managed-by=pulumi
                     linkerd.io/control-plane-component=destination
                     linkerd.io/control-plane-ns=linkerd
Annotations:         <none>
Image pull secrets:  <none>
Mountable secrets:   linkerd-destination-token-fm866
Tokens:              linkerd-destination-token-fm866
Events:              <none>

bothra90 avatar Jan 28 '22 19:01 bothra90

A related issue that I found on kubernetes: https://github.com/kubernetes/kubernetes/issues/82573. Let me know if you think that could explain what I'm seeing as well.

bothra90 avatar Jan 28 '22 19:01 bothra90

@bothra90 Yeah, that sounds plausible. The Linkerd project doesn't currently have any EKS credits, so I can't confirm this for myself; but it sounds likely, since the proxy runs under a non-root UID (2102). I'm not sure why this problem wouldn't manifest in prior linkerd versions, though.

olix0r avatar Jan 28 '22 19:01 olix0r

Indeed, applying the same fix as https://github.com/metallb/metallb/commit/d36e8dd4caa4f0c768c898fcf6eefd353ba55547 to linkerd2 pod configs resolves the issue for me.

bothra90 avatar Jan 28 '22 19:01 bothra90

We've recently added support for projected bounded service account tokens in https://github.com/linkerd/linkerd2/pull/7117, though I'm not sure if that will actually resolve this issue. Are you able to test the latest edge release? https://deploy-preview-1244--linkerdio.netlify.app/2.12/tasks/install-helm/#adding-linkerd-s-helm-repository

It may be worth applying the same workaround in the Linkerd config, but it would be good to confirm that the new functionality needs it.

olix0r avatar Jan 28 '22 20:01 olix0r

@olix0r: Sorry, I wasn't able to test the edge release. Will leave it up to you to decide what do with this issue.

bothra90 avatar Jan 29 '22 04:01 bothra90

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 03 '22 04:05 stale[bot]

I received an email from Amazon about service accounts attached to pods in one or more of your EKS clusters using stale (older than 1 hour) tokens. The related service account was linkerd-destination. Could this be related to #7117?

I'm running 2.11.1 in my EKS clusters. It looks like 2.11.2 is available but doesn't yet include the changes from #7117.

More info from the email:

Kubernetes version 1.21 graduated BoundServiceAccountTokenVolume feature [1] to beta and enabled it by default. This feature improves the security of service account tokens by requiring a one-hour expiry time, over the previous default of no expiration. This means that applications that do not refetch service account tokens periodically will receive an HTTP 401 unauthorized error response on requests to Kubernetes API server with expired tokens.

epinzur avatar May 20 '22 14:05 epinzur

@rltvty As far as I understand, the report you've received isn't actually about #7117. I believe it's reporting that the policy controller in 2.11.1 doesn't reload its service account tokens as they are rotated. This was fixed in 2.11.2 (via https://github.com/kube-rs/kube-rs/commit/cb2a3d901b1eefee75d755600994a77e679f6aa9).

olix0r avatar May 20 '22 14:05 olix0r

@olix0r thanks for the quick reply. I'll try 2.11.2 to see if this fixes our issue. If it doesn't, I'll create a new issue.

epinzur avatar May 20 '22 15:05 epinzur

@olix0r looks like the upgrade fixes the issue. thanks again!

epinzur avatar May 20 '22 17:05 epinzur

Hi, even with #7117 (using 30.1.4-edge) I'm seeing the same issue in the injected linkerd-proxy container:

Message:     time="2022-06-14T23:28:47Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2022-06-14T23:28:47Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[     0.012785s] ERROR ThreadId(01) linkerd_app::env: Could not read LINKERD2_PROXY_IDENTITY_TOKEN_FILE: Permission denied (os error 13)
[     0.012834s] ERROR ThreadId(01) linkerd_app::env: LINKERD2_PROXY_IDENTITY_TOKEN_FILE="/var/run/secrets/tokens/linkerd-identity-token" is not valid: InvalidTokenSource
Invalid configuration: invalid environment variable

It seems the token file is still only root readable

ls -la /var/run/secrets/tokens/..data/linkerd-identity-token
-rw------- 1 root root 1021 Jun 14 05:29 /var/run/secrets/tokens/..data/linkerd-identity-token

Is setting fsGroup on every single pod we expect linkerd injection the only solution?

jonathanasdf avatar Jun 14 '22 23:06 jonathanasdf

@jonathanasdf We should probably look into setting the fsGroup from the injector.


Some questions we'll need to answer:

  • Can we replicate this configuration in k3d? Or is this only reproducible in EKS? If the latter, we may need https://github.com/cncf/credits/issues/8 to verify the change.
  • Are there security implications to setting the fsGroup? Will this conflict with PSPs etc? Basically: can we always do this when using projected tokens or does this need to be a separate configuration?

olix0r avatar Jun 15 '22 20:06 olix0r

The serviceAccountToken file mode being hard-coded to 0600 (kubernetes/kubernetes#82573) was fixed in k8s 1.19. After that, the file mode got to be 0644 if no fsGroup was set (see the fix here). So it appears the issue here stemmed from using linkerd's new token volume projection in a pre-1.19 k8s?

Closing this one out, please reopen if you still experience this issue under later linkerd/k8s versions.

alpeb avatar Mar 10 '23 16:03 alpeb