cluster-api-provider-aws "The workload cluster kubeconfig (for use by controllers, not end users) causes unauthorized errors after exactly 15 minutes"

What steps did you take and what happened: With the cluster-api-provider-aws, the kubeconfig secret contains a hinted token which is retrieved by the AWS STS service (check here for reference) and needs to be refreshed every 15 minutes in the current configuration.

Instead of recreating the client for the observed cluster before the token expires/after the token expires, cluster-api fails with unauthorised errors and stops working for a few minutes. Interestingly this is also affecting the cluster-autoscaler, failing unauthorised, restarting and continuing. It might also be a bug in the client-go dependency.

What did you expect to happen: The kubeconfig provided allows the client to refresh credentials after they are expired.

Anything else you would like to add: Logs of all cap* component: https://gist.github.com/xvzf/78d47ce6c3d6fabd49bc902e5d22d467 Manifests to reproduce it: https://gist.github.com/xvzf/87c934945ad97fe3bb9ee4934c6478ce

Environment:

Cluster-api version: v1.0.2
Cluster-api-aws version v1.2.0
Minikube/KIND version: EKS 1.21.2 (should be irrelevant)
Kubernetes version: (use kubectl version): 1.21.2
OS (e.g. from /etc/os-release): AmazonLinux

/kind bug

Jan 07 '22 10:01 xvzf

@xvzf: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 07 '22 10:01 k8s-ci-robot

The root cause of the node deletion has been an assumingly terminated instance of cluster-autoscaler with a different configuration (it was still up in containerd).

The issue with the "dying autoscaler" and issues for cluster-api to communicate is still relevant though. Adapted the issue for this.

Jan 10 '22 14:01 xvzf

cc @richardcase for triaging.

Jan 19 '22 18:01 sedefsavas

cc @richardcase for triaging.

I need to look at the code but seems like a upstream CAPI issue....if the token/kubeconfig has been refreshed in CAPA.

Jan 20 '22 12:01 richardcase

The reason we have a max sync period of 10mins in CAPA when using EKS is so that the we refresh the token/kubeconfig before the 15min token expiry.

Jan 20 '22 12:01 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 24 '22 18:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 24 '22 19:05 k8s-triage-robot

/remove-lifecycle rotten

Jun 08 '22 11:06 richardcase

/assign /lifecycle active

Jun 14 '22 08:06 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 12 '22 09:09 k8s-triage-robot

Fyi, https://github.com/kubernetes-sigs/cluster-api/pull/7356 should fix the delay of multiple minutes until capi-controller is functioning properly again

Oct 06 '22 09:10 codablock

/remove-lifecycle stale

Oct 10 '22 10:10 richardcase

/retitle "The workload cluster kubeconfig (for use by controllers, not end users) causes unauthorized errors after exactly 15 minutes"

This only affects managed (EKS) clusters.

/area provider/eks

This is a known issue affecting cluster-autoscaler as well: https://github.com/kubernetes/autoscaler/issues/4784

/triage accepted

/priority important-soon

Dec 12 '22 17:12 dlipovetsky

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Mar 12 '23 17:03 k8s-triage-robot

This is known to be caused by 2 scenarios (but there may be more):

There is an issue with cluster autoscaler caching the kubeconfig when it first starts. There is an issue open in cluster auto scaler for this: https://github.com/kubernetes/autoscaler/issues/4784
Not enough reconciler go routines to process all the cluster declarations, which causes the token to be refreshed after in expires

@xvzf - would you be able to clarify what you expect from this? And also would it be possible to get some extra background on your scenario?

/triage needs-information

/cc @Skarlso

Apr 03 '23 16:04 richardcase

FYI I'm working with Michael on a solution for the kibeconfig refresh thing ( hence the cc ).

I've been out with covid recent days and then some holidays will come up. After that I'm resuming the fix.

Apr 03 '23 19:04 Skarlso

/triage accepted

May 01 '23 16:05 dlipovetsky

Hey

I'm not actively working with AWS anymore, maybe @charlie-haley knows more!

Cheers

May 05 '23 08:05 xvzf

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Jan 18 '24 23:01 k8s-triage-robot

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 18 '24 23:01 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 17 '24 23:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 17 '24 23:05 k8s-triage-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

"The workload cluster kubeconfig (for use by controllers, not end users) causes unauthorized errors after exactly 15 minutes"

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard