cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

"The workload cluster kubeconfig (for use by controllers, not end users) causes unauthorized errors after exactly 15 minutes"

Open xvzf opened this issue 3 years ago • 21 comments

What steps did you take and what happened: With the cluster-api-provider-aws, the kubeconfig secret contains a hinted token which is retrieved by the AWS STS service (check here for reference) and needs to be refreshed every 15 minutes in the current configuration.

Instead of recreating the client for the observed cluster before the token expires/after the token expires, cluster-api fails with unauthorised errors and stops working for a few minutes. Interestingly this is also affecting the cluster-autoscaler, failing unauthorised, restarting and continuing. It might also be a bug in the client-go dependency.

What did you expect to happen: The kubeconfig provided allows the client to refresh credentials after they are expired.

Anything else you would like to add: Logs of all cap* component: https://gist.github.com/xvzf/78d47ce6c3d6fabd49bc902e5d22d467 Manifests to reproduce it: https://gist.github.com/xvzf/87c934945ad97fe3bb9ee4934c6478ce

Environment:

  • Cluster-api version: v1.0.2
  • Cluster-api-aws version v1.2.0
  • Minikube/KIND version: EKS 1.21.2 (should be irrelevant)
  • Kubernetes version: (use kubectl version): 1.21.2
  • OS (e.g. from /etc/os-release): AmazonLinux

/kind bug

xvzf avatar Jan 07 '22 10:01 xvzf

@xvzf: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 07 '22 10:01 k8s-ci-robot

The root cause of the node deletion has been an assumingly terminated instance of cluster-autoscaler with a different configuration (it was still up in containerd).

The issue with the "dying autoscaler" and issues for cluster-api to communicate is still relevant though. Adapted the issue for this.

xvzf avatar Jan 10 '22 14:01 xvzf

cc @richardcase for triaging.

sedefsavas avatar Jan 19 '22 18:01 sedefsavas

cc @richardcase for triaging.

I need to look at the code but seems like a upstream CAPI issue....if the token/kubeconfig has been refreshed in CAPA.

richardcase avatar Jan 20 '22 12:01 richardcase

The reason we have a max sync period of 10mins in CAPA when using EKS is so that the we refresh the token/kubeconfig before the 15min token expiry.

richardcase avatar Jan 20 '22 12:01 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 24 '22 18:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 24 '22 19:05 k8s-triage-robot

/remove-lifecycle rotten

richardcase avatar Jun 08 '22 11:06 richardcase

/assign /lifecycle active

richardcase avatar Jun 14 '22 08:06 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 12 '22 09:09 k8s-triage-robot

Fyi, https://github.com/kubernetes-sigs/cluster-api/pull/7356 should fix the delay of multiple minutes until capi-controller is functioning properly again

codablock avatar Oct 06 '22 09:10 codablock

/remove-lifecycle stale

richardcase avatar Oct 10 '22 10:10 richardcase

/retitle "The workload cluster kubeconfig (for use by controllers, not end users) causes unauthorized errors after exactly 15 minutes"

This only affects managed (EKS) clusters.

/area provider/eks

This is a known issue affecting cluster-autoscaler as well: https://github.com/kubernetes/autoscaler/issues/4784

/triage accepted

/priority important-soon

dlipovetsky avatar Dec 12 '22 17:12 dlipovetsky

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Mar 12 '23 17:03 k8s-triage-robot

This is known to be caused by 2 scenarios (but there may be more):

  • There is an issue with cluster autoscaler caching the kubeconfig when it first starts. There is an issue open in cluster auto scaler for this: https://github.com/kubernetes/autoscaler/issues/4784
  • Not enough reconciler go routines to process all the cluster declarations, which causes the token to be refreshed after in expires

@xvzf - would you be able to clarify what you expect from this? And also would it be possible to get some extra background on your scenario?

/triage needs-information

/cc @Skarlso

richardcase avatar Apr 03 '23 16:04 richardcase

FYI I'm working with Michael on a solution for the kibeconfig refresh thing ( hence the cc ).

I've been out with covid recent days and then some holidays will come up. After that I'm resuming the fix.

Skarlso avatar Apr 03 '23 19:04 Skarlso

/triage accepted

dlipovetsky avatar May 01 '23 16:05 dlipovetsky

Hey

I'm not actively working with AWS anymore, maybe @charlie-haley knows more!

Cheers

xvzf avatar May 05 '23 08:05 xvzf

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Jan 18 '24 23:01 k8s-triage-robot

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 18 '24 23:01 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 17 '24 23:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 17 '24 23:05 k8s-triage-robot