kubeflow
kubeflow copied to clipboard
Bound Service Account Tokens compatibility in Kubernetes 1.21
/kind bug
What steps did you take and what happened:
Starting with Kubernetes 1.21 the way that Service Account tokens are issued and mounted by default changed to Bound Service Account Token Volume :
- https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume.
- https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md
With this feature, the Service Account tokens have an expiration date and the application is responsible for periodically reloading the token before it expires.
AWS has provided a guide on how to identify running Pods that use old Service Account tokens: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens .
To identify Pods that use stale Service Account tokens (in EKS):
- Have a cluster running for over an hour
- Enable the Audit Logs to CloudWatch option
- Open Logs Insights
- Select the log group of your cluster
- Find events that contain the annotation
annotations.authentication.k8s.io/stale-token
filter @logStream like 'kube-apiserver-audit' | filter ispresent(`annotations.authentication.k8s.io/stale-token`) | parse `annotations.authentication.k8s.io/stale-token` "subject: *," as subject
Using this method, we found some components that are using stale tokens in our Kubeflow deployment.
What did you expect to happen:
Kubeflow components to be compatible with Bound Service Account Tokens and no annotations.authentication.k8s.io/stale-token
Audit Event should appear.
Anything else you would like to add:
The official Kubernetes SDKs are updated to handle the token reload automatically, so the simplest solution would be to upgrade the Kubernetes SDK dependencies for all the affected components.
The Kubernetes SDK versions that handle this feature are:
- Go v0.15.7 and later
- Python v12.0.0 and later
- Java v9.0.0 and later
- Javascript v0.10.3 and later
- Ruby master branch
- Haskell v0.3.0.0
Environment:
- Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard):
1.4
- kfctl version: (use
kfctl version
): - Kubernetes platform:
EKS
- Kubernetes version:
1.21.9
- OS (e.g. from
/etc/os-release
):Amazon Linux
We have detected some components that are not ready for Bound Service Account Tokens. A (non exhaustive) list of them is:
-
centraldashboard
-
tensorboard-controller
(Should be fixed by https://github.com/kubeflow/kubeflow/pull/6406) -
dex
-
spark-operator
ml-pipeline/metadata-writer
also makes stale token. kubernets sdk version is still 10.1.0. It should be >= v12.0.0.
I have installed kubeflow on EKS and upgraded Kubernetes to v1.21. AWS extended the token expiry period to 90 days. And 90 days after I upgraded Kubernetes, the central dashboard pod on my cluster met with the following error:
Unable to fetch ConfigMap: { kind: 'Status', apiVersion: 'v1', metadata: {}, status: 'Failure', message: 'Unauthorized', reason: 'Unauthorized', code: 401 }
Unable to fetch Events for sakuzuai-mlops: { kind: 'Status', apiVersion: 'v1', metadata: {}, status: 'Failure', message: 'Unauthorized', reason: 'Unauthorized', code: 401 }
After I restarted all pods under kubeflow namespace, the error was gone. The central dashboard image tag I deployed is central-dashboard:v1.3.0
. This should be related with Bound Service Account Tokens.
/close
There has been no activity for a long time. Please reopen if necessary.
@juliusvonkohout: Closing this issue.
In response to this:
/close
There has been no activity for a long time. Please reopen if necessary.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.