metrics-server
metrics-server copied to clipboard
AWS EKS 1.21: stale service account token
What happened: We are running metrics-server on AWS EKS 1.21 and have gotten below mail from AWS (removed the irrelevant parts).
Hello,
We have identified applications running in one or more of your Amazon EKS clusters that are not refreshing
service account tokens. Applications making requests to Kubernetes API server with expired tokens will fail.
You can resolve the issue by updating your application and its dependencies to use newer versions of
Kubernetes client SDK that automatically refreshes the tokens.
What is the problem?
Kubernetes version 1.21 graduated BoundServiceAccountTokenVolume feature [1] to beta and enabled it by default.
This feature improves security of service account tokens by requiring a one hour expiry time, over the previous
default of no expiration. This means that applications that do not refetch service account tokens periodically
will receive an HTTP 401 unauthorized error response on requests to Kubernetes API server with expired tokens.
You can learn more about the BoundServiceAccountToken feature in EKS Kubernetes 1.21 release notes [2].
To enable a smooth migration of applications to the newer time-bound service account tokens, EKS v1.21+ extends
the lifetime of service account tokens to 90 days. Applications on EKS v1.21+ clusters that make API server
requests with tokens that are older than 90 days will receive an HTTP 401 unauthorised error response.
How can you resolve the issue?
To make the transition to time bound service account tokens easier, Kubernetes has updated the below official
versions of client SDKs to automatically refetch tokens before the one hour expiration:
* Go v0.15.7 and later
* Python v12.0.0 and later
* Java v9.0.0 and later
* Javascript v0.10.3 and later
* Ruby master branch
* Haskell v0.3.0.0
We recommend that you update your application and its dependencies to use one of the above client SDK versions
if you are on an older version.
As of April 20th 2022, we have identified the below service accounts attached to pods in one or more of your
EKS clusters using stale (older than 1 hour) tokens. Service accounts are listed in the format :
<eks-cluster-arn>|<namespace:serviceaccount>
...
arn:aws:eks:eu-west-1:xxx:cluster/ncc-1031|kube-system:metrics-server
...
We recommend that you update your applications and its dependencies that are using stale service accounts
tokens to use one of the newer Kubernetes Client SDKs that refetches tokens.
Environment:
- Kubernetes distribution: EKS 1.21
- Kubernetes version: v1.21.9-eks-0d102a7
Helm release:
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
name: metrics-server
namespace: kube-system
spec:
releaseName: metrics-server
chart:
repository: https://kubernetes-sigs.github.io/metrics-server/
name: metrics-server
version: 3.8.2
values:
podAnnotations:
supplystack.io/env: cluster
cc @stevehipwell Can you help identify what Metrics Server is running with
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
name: metrics-server
namespace: kube-system
spec:
releaseName: metrics-server
chart:
repository: https://kubernetes-sigs.github.io/metrics-server/
name: metrics-server
version: 3.8.2
values:
podAnnotations:
supplystack.io/env: cluster
@serathius ; should be 0.6.1
https://github.com/kubernetes-sigs/metrics-server/blob/abacf42babf4b4f623e992ff65761cd3902d0994/charts/metrics-server/values.yaml#L5-L9
and
https://github.com/kubernetes-sigs/metrics-server/blob/abacf42babf4b4f623e992ff65761cd3902d0994/charts/metrics-server/Chart.yaml#L6
Double-checked on the cluster in question:
k8s.gcr.io/metrics-server/metrics-server:v0.6.1
@serathius this is the latest version, v0.6.1, unless @BitProcessor has changed the image version manually?
AFAIK our EKS token reports haven't shown Metrics Server as an issue, we're only getting warned a Fluent Bit.
@serathius it's also worth pointing out that based on the client-go version Metrics Server should be working correctly.
@stevehipwell @serathius here's a similar issue: https://github.com/fluxcd/flux/issues/3610 (was also listed in the above report btw, removed it because irrelevant for this repo).
There seems to be more to it than only the SDK version, see https://github.com/fluxcd/flux/issues/3610#issuecomment-1125217274
@BitProcessor I received the same warning. Have you solved this problem,is there any updates, and is this related to fluent bit?
@hjflz8821 Fluent Bit wasn't using a Kubernetes SDK so they needed to make a change to their custom code, if you're using the client-go SDK there isn't meant to be any change needed.
What version of Metrics Server are you running and on which EKS version? Could you supply the part of the email related to Metrics Server?
@stevehipwell I'm using a lower version, I don't know if it's related to this, but my client judges that there is no need to update the Metrics Server about this warning and This question is more likely to be related to fluent-bit Metrics Server v0.3.6
We have verified that an application running on one or more Amazon EKS clusters hasn't updated the service account token. Applications that use expired tokens to make requests to the Kubernetes API server will fail. This issue can be resolved by updating the application and its dependencies to use the new version of the Kubernetes client SDK, which automatically updates tokens.
What is the problem?
Kubernetes version 1.21 migrated the BoundServiceAccountTokenVolume feature [1] to beta and enabled it by default. This feature improves the security of service account tokens by requiring an hour expiration compared to the previous default "no expiration". This means that applications that do not periodically refetch service account tokens will request the Kubernetes API server with an expired token and receive an HTTP 401 authentication failure error response. For more information on the BoundServiceAccountToken feature, see the EKS Kubernetes 1.21 Release Notes [2].
EKS v1.21 and later extends the service account token validity period to 90 days to facilitate the smooth migration of applications to the new time-limited service account token. Applications on EKS v1.21 and later clusters that request API servers with tokens longer than 90 days will receive an HTTP 401 authentication failure error response.
How can you resolve the issue?
To make the transition to time bound service account tokens easier, Kubernetes has updated the below official
versions of client SDKs to automatically refetch tokens before the one hour expiration:
* Go v0.15.7 and later
* Python v12.0.0 and later
* Java v9.0.0 and later
* Javascript v0.10.3 and later
* Ruby master branch
* Haskell v0.3.0.0
We recommend that you update your application and its dependencies to use one of the above client SDK versions
if you are on an older version.
Although not an exhaustive list, the following AWS components have been updated to use the new Kubernetes client SDK, which automatically refetches tokens:
* Amazon VPC CNI: v1.8.0 or later
* CoreDNS: v1.8.4 or later
* AWS Load Balancer Controller: v2.0.0 or later
* kube-proxy: v1.21.2-eksbuild.2 or later
**How can I identify the service account that uses the old token?
<eks-cluster-arn>|<namespace:serviceaccount>
arn:aws:eks:xxxxx:xxxxxxxx:cluster/xxxxxxxxx|metrics-server:metrics-metrics-server**
Use the CloudWatch Log Insights query provided in the Troubleshooting section of the EKS document [7] to see the current list of pods using service accounts with old tokens and the time elapsed since the tokens were created. You can also identify it.
.
We recommend that you update your applications using the old service account tokens and their dependencies to use one of the new Kubernetes Client SDKs that fetch tokens.
.
If your service account token is about to expire (less than 90 days) and you don't have enough time to update the client SDK version before it expires, you can exit the existing pod and create a new one. increase. This will fetch the service account tokens for an additional period (90 days) to update the client SDK.
Environment:
Kubernetes distribution: EKS 1.21 aws-for-fluent-bit: 2.26.0 (now)
community.kubernetes.helm:
name: metrics
chart_ref: metrics-server/metrics-server
release_namespace: metrics-server
update_repo_cache: true
chart_version: 2.11.2
atomic: true
create_namespace: true
kubeconfig: '{{ k8s_kubeconfig }}'
context: '{{ eks_cluster_context }}'
values: "{{ lookup('template', 'k8-charts/metrics-server/values.yml') | from_yaml }}"
state: "{{ state | default('present') }}"
tags:
- metrics-server
this is a part of the cloutwatch log,I'm not sure if this makes sense
subject: system:serviceaccount:metrics-server:metrics-metrics-server, seconds after warning threshold: 5323328
RBAC: allowed by ClusterRoleBinding "metrics-metrics-server:system:auth-delegator" of ClusterRole "system:auth-delegator" to ServiceAccount "metrics-metrics-server/metrics-server"
@hjflz8821 the version of Metrics Server you're running is 3 years old and built against a version of client-go which predates the refresh token implementation.
@stevehipwell Thanks a lot for your answer
so if I upgrade the version to V0.6.1, it will most likely solve this problem, because the latest version is equipped with a higher version of client-go
@hjflz8821 that would be my expectation.
We were also running a very old metrics-server in our case, and have been tracking this issue after upgrading. But I'm not sure how to verify that the new version actually handles refreshing. I mean we could wait 90 days and see if things start to die but I'd prefer not to 😄 I don't really expect AWS to send out another email either 🤔
@Flydiverny if you send the EKS control plane logs to CloudWatch it should show up there. Based on the AWS docs I'd expect to see it show up there after an hour.
@Flydiverny if you send the EKS control plane logs to CloudWatch it should show up there. Based on the AWS docs I'd expect to see it show up there after an hour.
Cheers :) I was a bit lazy, AWS provides some useful docs here: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens
Running the
fields @timestamp
| filter @logStream like /kube-apiserver-audit/
| filter @message like /seconds after warning threshold/
| parse @message "subject: *, seconds after warning threshold:*\"" as subject, elapsedtime
CloudWatch logs insights query clearly shows if you have stale tokens in use!
We are using metrics-server/metrics-server helm chart version 3.8.2 and it is working as intended! :)

Thanks @stevehipwell, learned something new 🎉
Edit: For completeness, we are also on EKS 1.21 Server Version: v1.21.12-eks-a64ea69
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.