linkerd2 (2.11) control plane pod failure on k8s 1.21
What is the issue?
When installing linkerd2 (version 2.11) on k8s 1.21 (EKS running on AWS) the control plane services fail to come up.
How can it be reproduced?
I'm installing linkerd2 via helm here, passing in manually generated the cert/keys as flags to helm.
The same setup has worked for us when running linkerd2 version 2.9 on k8s 1.18 and 1.19.
Logs, error output, etc
; k logs pods/linkerd-destination-6b4bfb9f87-hpvg4 -n linkerd linkerd-proxy
time="2022-01-28T18:13:19Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2022-01-28T18:13:19Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.001141s] ERROR ThreadId(01) linkerd_app::env: Could not read LINKERD2_PROXY_IDENTITY_TOKEN_FILE: Permission denied (os error 13)
[ 0.001176s] ERROR ThreadId(01) linkerd_app::env: LINKERD2_PROXY_IDENTITY_TOKEN_FILE="/var/run/secrets/kubernetes.io/serviceaccount/token" is not valid: InvalidTokenSource
Invalid configuration: invalid environment variable
output of linkerd check -o short
Linkerd core checks
===================
linkerd-existence
-----------------
\ pod/linkerd-destination-6b4bfb9f87-hpvg4 container sp-validator is not ready
Environment
Kubernetes: 1.21 Host Env: EKS/AWS Linkerd version: 2.11 HostOs: Amazon Linux2
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
@bothra90 It looks like that pod doesn't have a service account token so it can't authenticate to obtain its identity certificate.
Can you confirm that these resources exist:
:; k get sa -n linkerd
NAME SECRETS AGE
default 1 2d12h
linkerd-identity 1 2d12h
linkerd-destination 1 2d12h
linkerd-heartbeat 1 2d12h
linkerd-proxy-injector 1 2d12h
:; k get secret -n linkerd --field-selector 'type==kubernetes.io/service-account-token'
NAME TYPE DATA AGE
default-token-sbq77 kubernetes.io/service-account-token 3 2d12h
linkerd-identity-token-2kmvz kubernetes.io/service-account-token 3 2d12h
linkerd-destination-token-dzjcl kubernetes.io/service-account-token 3 2d12h
linkerd-heartbeat-token-kbdhf kubernetes.io/service-account-token 3 2d12h
linkerd-proxy-injector-token-v7j8l kubernetes.io/service-account-token 3 2d12h
@olix0r : yes, confirmed that all secrets and service accounts exist
I also see that the token is mountable by the linkerd-destination service account.
; k describe -n linkerd sa/linkerd-destination
Name: linkerd-destination
Namespace: linkerd
Labels: app.kubernetes.io/managed-by=pulumi
linkerd.io/control-plane-component=destination
linkerd.io/control-plane-ns=linkerd
Annotations: <none>
Image pull secrets: <none>
Mountable secrets: linkerd-destination-token-fm866
Tokens: linkerd-destination-token-fm866
Events: <none>
A related issue that I found on kubernetes: https://github.com/kubernetes/kubernetes/issues/82573. Let me know if you think that could explain what I'm seeing as well.
@bothra90 Yeah, that sounds plausible. The Linkerd project doesn't currently have any EKS credits, so I can't confirm this for myself; but it sounds likely, since the proxy runs under a non-root UID (2102). I'm not sure why this problem wouldn't manifest in prior linkerd versions, though.
Indeed, applying the same fix as https://github.com/metallb/metallb/commit/d36e8dd4caa4f0c768c898fcf6eefd353ba55547 to linkerd2 pod configs resolves the issue for me.
We've recently added support for projected bounded service account tokens in https://github.com/linkerd/linkerd2/pull/7117, though I'm not sure if that will actually resolve this issue. Are you able to test the latest edge release? https://deploy-preview-1244--linkerdio.netlify.app/2.12/tasks/install-helm/#adding-linkerd-s-helm-repository
It may be worth applying the same workaround in the Linkerd config, but it would be good to confirm that the new functionality needs it.
@olix0r: Sorry, I wasn't able to test the edge release. Will leave it up to you to decide what do with this issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
I received an email from Amazon about service accounts attached to pods in one or more of your EKS clusters using stale (older than 1 hour) tokens. The related service account was linkerd-destination. Could this be related to #7117?
I'm running 2.11.1 in my EKS clusters. It looks like 2.11.2 is available but doesn't yet include the changes from #7117.
More info from the email:
Kubernetes version 1.21 graduated BoundServiceAccountTokenVolume feature [1] to beta and enabled it by default. This feature improves the security of service account tokens by requiring a one-hour expiry time, over the previous default of no expiration. This means that applications that do not refetch service account tokens periodically will receive an HTTP 401 unauthorized error response on requests to Kubernetes API server with expired tokens.
@rltvty As far as I understand, the report you've received isn't actually about #7117. I believe it's reporting that the policy controller in 2.11.1 doesn't reload its service account tokens as they are rotated. This was fixed in 2.11.2 (via https://github.com/kube-rs/kube-rs/commit/cb2a3d901b1eefee75d755600994a77e679f6aa9).
@olix0r thanks for the quick reply. I'll try 2.11.2 to see if this fixes our issue. If it doesn't, I'll create a new issue.
@olix0r looks like the upgrade fixes the issue. thanks again!
Hi, even with #7117 (using 30.1.4-edge) I'm seeing the same issue in the injected linkerd-proxy container:
Message: time="2022-06-14T23:28:47Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2022-06-14T23:28:47Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.012785s] ERROR ThreadId(01) linkerd_app::env: Could not read LINKERD2_PROXY_IDENTITY_TOKEN_FILE: Permission denied (os error 13)
[ 0.012834s] ERROR ThreadId(01) linkerd_app::env: LINKERD2_PROXY_IDENTITY_TOKEN_FILE="/var/run/secrets/tokens/linkerd-identity-token" is not valid: InvalidTokenSource
Invalid configuration: invalid environment variable
It seems the token file is still only root readable
ls -la /var/run/secrets/tokens/..data/linkerd-identity-token
-rw------- 1 root root 1021 Jun 14 05:29 /var/run/secrets/tokens/..data/linkerd-identity-token
Is setting fsGroup on every single pod we expect linkerd injection the only solution?
@jonathanasdf We should probably look into setting the fsGroup from the injector.
Some questions we'll need to answer:
- Can we replicate this configuration in k3d? Or is this only reproducible in EKS? If the latter, we may need https://github.com/cncf/credits/issues/8 to verify the change.
- Are there security implications to setting the
fsGroup? Will this conflict with PSPs etc? Basically: can we always do this when using projected tokens or does this need to be a separate configuration?
The serviceAccountToken file mode being hard-coded to 0600 (kubernetes/kubernetes#82573) was fixed in k8s 1.19. After that, the file mode got to be 0644 if no fsGroup was set (see the fix here). So it appears the issue here stemmed from using linkerd's new token volume projection in a pre-1.19 k8s?
Closing this one out, please reopen if you still experience this issue under later linkerd/k8s versions.