splunk-connect-for-kubernetes
splunk-connect-for-kubernetes copied to clipboard
Compatibility with EKS 1.21 and token service account expiry
What happened: After our EKS was upgraded to 1.21, we saw annotations like the following appear in api server audit logs in AWS, for service accounts that Splunk Connect pods are using:
subject: system:serviceaccount:<namespace here>:<sa name here>, seconds after warning threshold: 3989
This is due to changes in token expiry in K8s 1.21 as described here: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens
It would appear that there is 90d grace period, after which tokens will be rejected. It looks like the Splunk Connect agents needs to use a later client SDK version, or is there a workaround?
What you expected to happen: More recent k8s client sdk was used in 1.4.11 Splunk Connect used so that the tokens wouldn't get flagged. At some kube version when AWS will change to the default 1h tokens, the pods will get errors from api server after an hour (unless they are restarted earlier, as that would refresh the token I think as well).
How to reproduce it (as minimally and precisely as possible): Install or upgrade EKS to 1.21 and check EKS cluster api server audit logs with this query:
fields @timestamp
| filter @logStream like /kube-apiserver-audit/
| filter @message like /seconds after warning threshold/
| parse @message "subject: *, seconds after warning threshold:*\"" as subject, elapsedtime
based on: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): 1.21 - Ruby version (use
ruby --version
): - OS (e.g:
cat /etc/os-release
): - Splunk version:
- Splunk Connect for Kubernetes helm chart version: 1.14.11
- Others:
@harshit-splunk actually I would think this is a bug, as it can lead to issue after 90d, for clusters where nodes don't cycle often. Can you pls change priority of this?
I can see this warning also on OpenShift 4.8 (K8s 1.21). Can't find documentation about the "hard expiry" time on OpenShift.
BoundServiceAccountTokenVolume: Migrate ServiceAccount volumes to use a projected volume consisting of a ServiceAccountTokenVolumeProjection. Cluster admins can use metric serviceaccount_stale_tokens_total to monitor workloads that are depending on the extended tokens. If there are no such workloads, turn off extended tokens by starting kube-apiserver with flag --service-account-extend-token-expiration=false. Check Bound Service Account Tokens for more details.
in Bound Service Account Tokens: Safe Rollout of Time-bound Token I read:
These extended tokens would not expire and continue to be accepted within one year.
@vinzent I think EKS has made them valid only for 90d, but yeah I think vanilla k8s and other distros stick with the default 1y, and at some k8s version after that time, they will switch to the default of 1h. Btw. this would not count from now ofc, the 1.21 was released like mid last year? and we get 3 releases a year so there might be not really 1y left to update this.
This is a major issue for us and will result in us having to look at alternative technologies. We've raised it with our splunk account manager, so hopefully it can be fixed quickly.
We have prioritised this request and will be working on this ASAP. Meanwhile, you can probably use service-account-token-volume-projection where you can provide expirationSeconds
for token. I haven't tested it yet though.
@harshit-splunk in the doc you linked there is also this statement:
- The application is responsible for reloading the token when it rotates. Periodic reloading (e.g. once every 5 minutes) is sufficient for most use cases.
before I jump into this, can you please confirm this is periodically reloaded by the software and not only on app/pod start-up?
@tomsucho kubelet will update the token file periodically based on the expiration time. Until refreshing the token isn't supported, we can probably increase token expiration time (i.e. 1 month).
@harshit-splunk but will the application i.e. fluentd reload the token periodically? i.e. what's the point if I configure to refresh it by kubelet each hour, if the fluentd process will use the token it read during startup?
the token is already changed by k8s periodically. Only the user of the token needs to reload it from disk. IMHO "vanilla" fluentd doesnt use the k8s api. it's the plugins that use it. Like kubernetes_metadata_filter.
There is already an issue open about that: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/issues/323
maybe SCK uses more k8s api things for object/metrics components.
This is also an issue for us, especially after AWS has started sending e-mails reminding about it...
I can confirm that running the 1.4.15 version of the chart still produces log messages about tokens that are too old.
Also an issue here. Any ETA regarding when a new chart version will be released to solve this problem?
Also an issue here. Any ETA regarding when a new chart version will be released to solve this problem?
Hi @tomsucho, @vinzent, I have updated each plugin so that it will read the bearer token file on each request. I have pushed images to docker hub.
hvaghani/fluentd-hec:1.2.13-refresh-token
hvaghani/kube-objects:1.1.12-refresh-token
hvaghani/k8s-metrics:1.1.12-refresh-token
hvaghani/k8s-metrics-aggr:1.1.12-refresh-token
Can you test these images if it is working? I have tested it on EKS, haven't received any audit logs for warning threshold so far.
@harshit-splunk probably tomorrow I can test the hvaghani/fluentd-hec image. The other components we don't use.
@harshit-splunk pods with hvaghani/fluentd-hec:1.2.13-refresh-token
are running 3 hours now. I don't see any annotations.authentication.k8s.io/stale-token
in audit logs on OpenShift 4.8. According pod definition the token expires after 3607 seconds (expirationSeconds: 3607
). I would expect that without refreshed token the stale-token audit annotation would already be back.
This issue is stale because it has been open for 30 days with no activity.
Not stale
Sorry to be impatient; but is it possible to get an update on this fix being released to a new version of splunk/fluentd-hec
?
@boatmisser I have created https://github.com/ManageIQ/kubeclient/pull/566 and waiting for it to be merged. Meanwhile you can use hvaghani/fluentd-hec:1.2.13-refresh-token
image. It is created from https://github.com/splunk/fluent-plugin-splunk-hec/pull/248
Support for refreshing the token is added in SCK 1.5.0 release. Closing the issue.