metrics-server icon indicating copy to clipboard operation
metrics-server copied to clipboard

AWS EKS 1.21: stale service account token

Open BitProcessor opened this issue 3 years ago • 16 comments

What happened: We are running metrics-server on AWS EKS 1.21 and have gotten below mail from AWS (removed the irrelevant parts).

Hello,

We have identified applications running in one or more of your Amazon EKS clusters that are not refreshing 
service account tokens. Applications making requests to Kubernetes API server with expired tokens will fail. 
You can resolve the issue by updating your application and its dependencies to use newer versions of 
Kubernetes client SDK that automatically refreshes the tokens. 

What is the problem?

Kubernetes version 1.21 graduated BoundServiceAccountTokenVolume feature [1] to beta and enabled it by default. 
This feature improves security of service account tokens by requiring a one hour expiry time, over the previous
default of no expiration. This means that applications that do not refetch service account tokens periodically 
will receive an HTTP 401 unauthorized error response on requests to Kubernetes API server with expired tokens. 
You can learn more about the BoundServiceAccountToken feature in EKS Kubernetes 1.21 release notes [2].

To enable a smooth migration of applications to the newer time-bound service account tokens, EKS v1.21+ extends
the lifetime of service account tokens to 90 days. Applications on EKS v1.21+ clusters that make API server
requests with tokens that are older than 90 days will receive an HTTP 401 unauthorised  error response.

How can you resolve the issue?

To make the transition to time bound service account tokens easier, Kubernetes has updated the below official
versions of client SDKs to automatically refetch tokens before the one hour expiration:

* Go v0.15.7 and later
* Python v12.0.0 and later
* Java v9.0.0 and later
* Javascript v0.10.3 and later
* Ruby master branch
* Haskell v0.3.0.0

We recommend that you update your application and its dependencies to use one of the above client SDK versions
if you are on an older version. 

As of April 20th 2022, we have identified the below service accounts attached to pods in one or more of your
EKS clusters using stale (older than 1 hour) tokens. Service accounts are listed in the format :
<eks-cluster-arn>|<namespace:serviceaccount>

...
arn:aws:eks:eu-west-1:xxx:cluster/ncc-1031|kube-system:metrics-server
...


We recommend that you update your applications and its dependencies that are using stale service accounts
tokens to use one of the newer Kubernetes Client SDKs that refetches tokens. 

Environment:

  • Kubernetes distribution: EKS 1.21
  • Kubernetes version: v1.21.9-eks-0d102a7

Helm release:

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: metrics-server
  namespace: kube-system
spec:
  releaseName: metrics-server
  chart:
    repository: https://kubernetes-sigs.github.io/metrics-server/
    name: metrics-server
    version: 3.8.2
  values:
    podAnnotations:
      supplystack.io/env: cluster

BitProcessor avatar May 18 '22 07:05 BitProcessor

cc @stevehipwell Can you help identify what Metrics Server is running with

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: metrics-server
  namespace: kube-system
spec:
  releaseName: metrics-server
  chart:
    repository: https://kubernetes-sigs.github.io/metrics-server/
    name: metrics-server
    version: 3.8.2
  values:
    podAnnotations:
      supplystack.io/env: cluster

serathius avatar May 18 '22 08:05 serathius

@serathius ; should be 0.6.1 https://github.com/kubernetes-sigs/metrics-server/blob/abacf42babf4b4f623e992ff65761cd3902d0994/charts/metrics-server/values.yaml#L5-L9 and https://github.com/kubernetes-sigs/metrics-server/blob/abacf42babf4b4f623e992ff65761cd3902d0994/charts/metrics-server/Chart.yaml#L6

Double-checked on the cluster in question: k8s.gcr.io/metrics-server/metrics-server:v0.6.1

BitProcessor avatar May 18 '22 08:05 BitProcessor

@serathius this is the latest version, v0.6.1, unless @BitProcessor has changed the image version manually?

AFAIK our EKS token reports haven't shown Metrics Server as an issue, we're only getting warned a Fluent Bit.

stevehipwell avatar May 18 '22 08:05 stevehipwell

@serathius it's also worth pointing out that based on the client-go version Metrics Server should be working correctly.

stevehipwell avatar May 18 '22 08:05 stevehipwell

@stevehipwell @serathius here's a similar issue: https://github.com/fluxcd/flux/issues/3610 (was also listed in the above report btw, removed it because irrelevant for this repo).

There seems to be more to it than only the SDK version, see https://github.com/fluxcd/flux/issues/3610#issuecomment-1125217274

BitProcessor avatar May 18 '22 08:05 BitProcessor

@BitProcessor I received the same warning. Have you solved this problem,is there any updates, and is this related to fluent bit?

hjflz8821 avatar Jun 14 '22 04:06 hjflz8821

@hjflz8821 Fluent Bit wasn't using a Kubernetes SDK so they needed to make a change to their custom code, if you're using the client-go SDK there isn't meant to be any change needed.

What version of Metrics Server are you running and on which EKS version? Could you supply the part of the email related to Metrics Server?

stevehipwell avatar Jun 14 '22 08:06 stevehipwell

@stevehipwell I'm using a lower version, I don't know if it's related to this, but my client judges that there is no need to update the Metrics Server about this warning and This question is more likely to be related to fluent-bit Metrics Server v0.3.6

We have verified that an application running on one or more Amazon EKS clusters hasn't updated the service account token. Applications that use expired tokens to make requests to the Kubernetes API server will fail. This issue can be resolved by updating the application and its dependencies to use the new version of the Kubernetes client SDK, which automatically updates tokens.

What is the problem?

Kubernetes version 1.21 migrated the BoundServiceAccountTokenVolume feature [1] to beta and enabled it by default. This feature improves the security of service account tokens by requiring an hour expiration compared to the previous default "no expiration". This means that applications that do not periodically refetch service account tokens will request the Kubernetes API server with an expired token and receive an HTTP 401 authentication failure error response. For more information on the BoundServiceAccountToken feature, see the EKS Kubernetes 1.21 Release Notes [2].

EKS v1.21 and later extends the service account token validity period to 90 days to facilitate the smooth migration of applications to the new time-limited service account token. Applications on EKS v1.21 and later clusters that request API servers with tokens longer than 90 days will receive an HTTP 401 authentication failure error response.

How can you resolve the issue?

To make the transition to time bound service account tokens easier, Kubernetes has updated the below official
versions of client SDKs to automatically refetch tokens before the one hour expiration:

* Go v0.15.7 and later
* Python v12.0.0 and later
* Java v9.0.0 and later
* Javascript v0.10.3 and later
* Ruby master branch
* Haskell v0.3.0.0

We recommend that you update your application and its dependencies to use one of the above client SDK versions
if you are on an older version. 

Although not an exhaustive list, the following AWS components have been updated to use the new Kubernetes client SDK, which automatically refetches tokens:

* Amazon VPC CNI: v1.8.0 or later
* CoreDNS: v1.8.4 or later
* AWS Load Balancer Controller: v2.0.0 or later
* kube-proxy: v1.21.2-eksbuild.2 or later

**How can I identify the service account that uses the old token?
<eks-cluster-arn>|<namespace:serviceaccount>
arn:aws:eks:xxxxx:xxxxxxxx:cluster/xxxxxxxxx|metrics-server:metrics-metrics-server**

Use the CloudWatch Log Insights query provided in the Troubleshooting section of the EKS document [7] to see the current list of pods using service accounts with old tokens and the time elapsed since the tokens were created. You can also identify it.
.
We recommend that you update your applications using the old service account tokens and their dependencies to use one of the new Kubernetes Client SDKs that fetch tokens.
.
If your service account token is about to expire (less than 90 days) and you don't have enough time to update the client SDK version before it expires, you can exit the existing pod and create a new one. increase. This will fetch the service account tokens for an additional period (90 days) to update the client SDK.

Environment:

Kubernetes distribution: EKS 1.21 aws-for-fluent-bit: 2.26.0 (now)

      community.kubernetes.helm:
        name: metrics
        chart_ref: metrics-server/metrics-server
        release_namespace: metrics-server
        update_repo_cache: true
        chart_version: 2.11.2
        atomic: true
        create_namespace: true
        kubeconfig: '{{ k8s_kubeconfig }}'
        context: '{{ eks_cluster_context }}'
        values: "{{ lookup('template', 'k8-charts/metrics-server/values.yml') | from_yaml }}"
        state: "{{ state | default('present') }}"
      tags:
        - metrics-server

this is a part of the cloutwatch log,I'm not sure if this makes sense

subject: system:serviceaccount:metrics-server:metrics-metrics-server, seconds after warning threshold: 5323328

RBAC: allowed by ClusterRoleBinding "metrics-metrics-server:system:auth-delegator" of ClusterRole "system:auth-delegator" to ServiceAccount "metrics-metrics-server/metrics-server" 

hjflz8821 avatar Jun 15 '22 03:06 hjflz8821

@hjflz8821 the version of Metrics Server you're running is 3 years old and built against a version of client-go which predates the refresh token implementation.

stevehipwell avatar Jun 15 '22 06:06 stevehipwell

@stevehipwell Thanks a lot for your answer

so if I upgrade the version to V0.6.1, it will most likely solve this problem, because the latest version is equipped with a higher version of client-go

hjflz8821 avatar Jun 15 '22 08:06 hjflz8821

@hjflz8821 that would be my expectation.

stevehipwell avatar Jun 15 '22 08:06 stevehipwell

We were also running a very old metrics-server in our case, and have been tracking this issue after upgrading. But I'm not sure how to verify that the new version actually handles refreshing. I mean we could wait 90 days and see if things start to die but I'd prefer not to 😄 I don't really expect AWS to send out another email either 🤔

Flydiverny avatar Jun 15 '22 11:06 Flydiverny

@Flydiverny if you send the EKS control plane logs to CloudWatch it should show up there. Based on the AWS docs I'd expect to see it show up there after an hour.

stevehipwell avatar Jun 15 '22 12:06 stevehipwell

@Flydiverny if you send the EKS control plane logs to CloudWatch it should show up there. Based on the AWS docs I'd expect to see it show up there after an hour.

Cheers :) I was a bit lazy, AWS provides some useful docs here: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens

Running the

fields @timestamp
| filter @logStream like /kube-apiserver-audit/
| filter @message like /seconds after warning threshold/
| parse @message "subject: *, seconds after warning threshold:*\"" as subject, elapsedtime

CloudWatch logs insights query clearly shows if you have stale tokens in use!

We are using metrics-server/metrics-server helm chart version 3.8.2 and it is working as intended! :)

image

Thanks @stevehipwell, learned something new 🎉


Edit: For completeness, we are also on EKS 1.21 Server Version: v1.21.12-eks-a64ea69

Flydiverny avatar Jun 15 '22 15:06 Flydiverny

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 13 '22 16:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 13 '22 16:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 12 '22 17:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 12 '22 17:11 k8s-ci-robot