cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
"The workload cluster kubeconfig (for use by controllers, not end users) causes unauthorized errors after exactly 15 minutes"
What steps did you take and what happened:
With the cluster-api-provider-aws, the kubeconfig
secret contains a hinted token which is retrieved by the AWS STS service (check here for reference) and needs to be refreshed every 15 minutes in the current configuration.
Instead of recreating the client for the observed cluster before the token expires/after the token expires, cluster-api fails with unauthorised errors and stops working for a few minutes. Interestingly this is also affecting the cluster-autoscaler, failing unauthorised, restarting and continuing. It might also be a bug in the client-go
dependency.
What did you expect to happen: The kubeconfig provided allows the client to refresh credentials after they are expired.
Anything else you would like to add:
Logs of all cap*
component: https://gist.github.com/xvzf/78d47ce6c3d6fabd49bc902e5d22d467
Manifests to reproduce it: https://gist.github.com/xvzf/87c934945ad97fe3bb9ee4934c6478ce
Environment:
- Cluster-api version: v1.0.2
- Cluster-api-aws version v1.2.0
- Minikube/KIND version: EKS 1.21.2 (should be irrelevant)
- Kubernetes version: (use
kubectl version
): 1.21.2 - OS (e.g. from
/etc/os-release
): AmazonLinux
/kind bug
@xvzf: This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The root cause of the node deletion has been an assumingly terminated instance of cluster-autoscaler with a different configuration (it was still up in containerd).
The issue with the "dying autoscaler" and issues for cluster-api to communicate is still relevant though. Adapted the issue for this.
cc @richardcase for triaging.
cc @richardcase for triaging.
I need to look at the code but seems like a upstream CAPI issue....if the token/kubeconfig has been refreshed in CAPA.
The reason we have a max sync period of 10mins in CAPA when using EKS is so that the we refresh the token/kubeconfig before the 15min token expiry.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
/assign /lifecycle active
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Fyi, https://github.com/kubernetes-sigs/cluster-api/pull/7356 should fix the delay of multiple minutes until capi-controller is functioning properly again
/remove-lifecycle stale
/retitle "The workload cluster kubeconfig (for use by controllers, not end users) causes unauthorized errors after exactly 15 minutes"
This only affects managed (EKS) clusters.
/area provider/eks
This is a known issue affecting cluster-autoscaler as well: https://github.com/kubernetes/autoscaler/issues/4784
/triage accepted
/priority important-soon
This issue is labeled with priority/important-soon
but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted
(org members only) - Deprioritize it with
/priority important-longterm
or/priority backlog
- Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
This is known to be caused by 2 scenarios (but there may be more):
- There is an issue with cluster autoscaler caching the kubeconfig when it first starts. There is an issue open in cluster auto scaler for this: https://github.com/kubernetes/autoscaler/issues/4784
- Not enough reconciler go routines to process all the cluster declarations, which causes the token to be refreshed after in expires
@xvzf - would you be able to clarify what you expect from this? And also would it be possible to get some extra background on your scenario?
/triage needs-information
/cc @Skarlso
FYI I'm working with Michael on a solution for the kibeconfig refresh thing ( hence the cc ).
I've been out with covid recent days and then some holidays will come up. After that I'm resuming the fix.
/triage accepted
Hey
I'm not actively working with AWS anymore, maybe @charlie-haley knows more!
Cheers
This issue is labeled with priority/important-soon
but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted
(org members only) - Deprioritize it with
/priority important-longterm
or/priority backlog
- Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten