splunk-connect-for-kubernetes icon indicating copy to clipboard operation
splunk-connect-for-kubernetes copied to clipboard

Compatibility with EKS 1.21 and token service account expiry

Open tomsucho opened this issue 2 years ago • 19 comments

What happened: After our EKS was upgraded to 1.21, we saw annotations like the following appear in api server audit logs in AWS, for service accounts that Splunk Connect pods are using:

subject: system:serviceaccount:<namespace here>:<sa name here>, seconds after warning threshold: 3989

This is due to changes in token expiry in K8s 1.21 as described here: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens

It would appear that there is 90d grace period, after which tokens will be rejected. It looks like the Splunk Connect agents needs to use a later client SDK version, or is there a workaround?

What you expected to happen: More recent k8s client sdk was used in 1.4.11 Splunk Connect used so that the tokens wouldn't get flagged. At some kube version when AWS will change to the default 1h tokens, the pods will get errors from api server after an hour (unless they are restarted earlier, as that would refresh the token I think as well).

How to reproduce it (as minimally and precisely as possible): Install or upgrade EKS to 1.21 and check EKS cluster api server audit logs with this query:

fields @timestamp
| filter @logStream like /kube-apiserver-audit/
| filter @message like /seconds after warning threshold/
| parse @message "subject: *, seconds after warning threshold:*\"" as subject, elapsedtime

based on: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.21
  • Ruby version (use ruby --version):
  • OS (e.g: cat /etc/os-release):
  • Splunk version:
  • Splunk Connect for Kubernetes helm chart version: 1.14.11
  • Others:

tomsucho avatar May 13 '22 07:05 tomsucho

@harshit-splunk actually I would think this is a bug, as it can lead to issue after 90d, for clusters where nodes don't cycle often. Can you pls change priority of this?

tomsucho avatar May 20 '22 07:05 tomsucho

I can see this warning also on OpenShift 4.8 (K8s 1.21). Can't find documentation about the "hard expiry" time on OpenShift.

Feature docs:

BoundServiceAccountTokenVolume: Migrate ServiceAccount volumes to use a projected volume consisting of a ServiceAccountTokenVolumeProjection. Cluster admins can use metric serviceaccount_stale_tokens_total to monitor workloads that are depending on the extended tokens. If there are no such workloads, turn off extended tokens by starting kube-apiserver with flag --service-account-extend-token-expiration=false. Check Bound Service Account Tokens for more details.

in Bound Service Account Tokens: Safe Rollout of Time-bound Token I read:

These extended tokens would not expire and continue to be accepted within one year.

vinzent avatar May 20 '22 08:05 vinzent

@vinzent I think EKS has made them valid only for 90d, but yeah I think vanilla k8s and other distros stick with the default 1y, and at some k8s version after that time, they will switch to the default of 1h. Btw. this would not count from now ofc, the 1.21 was released like mid last year? and we get 3 releases a year so there might be not really 1y left to update this.

tomsucho avatar May 20 '22 08:05 tomsucho

This is a major issue for us and will result in us having to look at alternative technologies. We've raised it with our splunk account manager, so hopefully it can be fixed quickly.

eperdeme avatar May 31 '22 08:05 eperdeme

We have prioritised this request and will be working on this ASAP. Meanwhile, you can probably use service-account-token-volume-projection where you can provide expirationSeconds for token. I haven't tested it yet though.

hvaghani221 avatar May 31 '22 12:05 hvaghani221

@harshit-splunk in the doc you linked there is also this statement:

  • The application is responsible for reloading the token when it rotates. Periodic reloading (e.g. once every 5 minutes) is sufficient for most use cases.

before I jump into this, can you please confirm this is periodically reloaded by the software and not only on app/pod start-up?

tomsucho avatar May 31 '22 13:05 tomsucho

@tomsucho kubelet will update the token file periodically based on the expiration time. Until refreshing the token isn't supported, we can probably increase token expiration time (i.e. 1 month).

hvaghani221 avatar May 31 '22 13:05 hvaghani221

@harshit-splunk but will the application i.e. fluentd reload the token periodically? i.e. what's the point if I configure to refresh it by kubelet each hour, if the fluentd process will use the token it read during startup?

tomsucho avatar May 31 '22 13:05 tomsucho

the token is already changed by k8s periodically. Only the user of the token needs to reload it from disk. IMHO "vanilla" fluentd doesnt use the k8s api. it's the plugins that use it. Like kubernetes_metadata_filter.

There is already an issue open about that: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/issues/323

maybe SCK uses more k8s api things for object/metrics components.

vinzent avatar May 31 '22 14:05 vinzent

This is also an issue for us, especially after AWS has started sending e-mails reminding about it...

I can confirm that running the 1.4.15 version of the chart still produces log messages about tokens that are too old.

herself avatar May 31 '22 16:05 herself

Also an issue here. Any ETA regarding when a new chart version will be released to solve this problem?

marc1161 avatar Jun 09 '22 09:06 marc1161

Also an issue here. Any ETA regarding when a new chart version will be released to solve this problem?

Anoojak avatar Jun 19 '22 13:06 Anoojak

Hi @tomsucho, @vinzent, I have updated each plugin so that it will read the bearer token file on each request. I have pushed images to docker hub.

hvaghani/fluentd-hec:1.2.13-refresh-token
hvaghani/kube-objects:1.1.12-refresh-token
hvaghani/k8s-metrics:1.1.12-refresh-token
hvaghani/k8s-metrics-aggr:1.1.12-refresh-token

Can you test these images if it is working? I have tested it on EKS, haven't received any audit logs for warning threshold so far.

hvaghani221 avatar Jul 05 '22 11:07 hvaghani221

@harshit-splunk probably tomorrow I can test the hvaghani/fluentd-hec image. The other components we don't use.

vinzent avatar Jul 07 '22 14:07 vinzent

@harshit-splunk pods with hvaghani/fluentd-hec:1.2.13-refresh-token are running 3 hours now. I don't see any annotations.authentication.k8s.io/stale-token in audit logs on OpenShift 4.8. According pod definition the token expires after 3607 seconds (expirationSeconds: 3607). I would expect that without refreshed token the stale-token audit annotation would already be back.

vinzent avatar Jul 08 '22 14:07 vinzent

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Aug 08 '22 02:08 github-actions[bot]

Not stale

hvaghani221 avatar Aug 08 '22 04:08 hvaghani221

Sorry to be impatient; but is it possible to get an update on this fix being released to a new version of splunk/fluentd-hec?

boatmisser avatar Aug 08 '22 23:08 boatmisser

@boatmisser I have created https://github.com/ManageIQ/kubeclient/pull/566 and waiting for it to be merged. Meanwhile you can use hvaghani/fluentd-hec:1.2.13-refresh-token image. It is created from https://github.com/splunk/fluent-plugin-splunk-hec/pull/248

hvaghani221 avatar Aug 09 '22 07:08 hvaghani221

Support for refreshing the token is added in SCK 1.5.0 release. Closing the issue.

hvaghani221 avatar Aug 17 '22 08:08 hvaghani221