cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
✨ Add separate eks kubeconfig secret keys for the cluster-autoscaler
What type of PR is this? /kind feature
What this PR does / why we need it: Cluster Autoscaler can not mount and consume the Cluster API Kubeconfig because the secret contents are refreshed every ten minutes, and no API Machinery exists to reload a kubeconfig safely.
Initially, I attempted solve this in the Cluster Autoscaler: https://github.com/kubernetes/autoscaler/issues/4784 - However meeting with SIG APIMachinery on Nov 1 2023, the SIG cautioned against this approach and advised splitting the token out from the kubeconfig, as there is existing machinery to reload a token file auth. By switching to this approach, no change in the Cluster Autoscaler is needed, users only need to update their Cluster Autoscaler configuration to use the correct secret file from their secret volume mount.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4607
Special notes for your reviewer:
Checklist:
- [X] squashed commits
- [X] includes documentation
- [X] adds unit tests
- [ ] adds or updates e2e tests
Release note:
Add separate eks kubeconfig secret keys for the cluster-autoscaler to support refreshing the token automatically, see eks kubeconfig for more info.
~This is ready to be reviewed, but I haven't had an opportunity to test the full setup with this change + cluster autoscaler end to end yet, so marked as a draft until I verify it.~
I have deployed this change in Indeed's clusters and in my testing, the official release of v1.27 cluster autoscaler was able to refresh the bearer token file in EKS clusters.
/hold cancel
/lgtm
/test pull-cluster-api-provider-aws-e2e /test pull-cluster-api-provider-aws-e2e-eks
It would be good to get cluster autoscaler added to our e2e tests. Lets create an issue to follow up on this.
From my side this look good:
/approve
When the e2e passes we can unhold and merge:
/hold
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: richardcase
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [richardcase]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/retest-required
New changes are detected. LGTM label has been removed.
Only a rebase.
/test pull-cluster-api-provider-aws-e2e /test pull-cluster-api-provider-aws-e2e-eks /lgtm
You will have to fix the test first. /lgtm cancel
/test pull-cluster-api-provider-aws-e2e-eks
@mloiseleur seems like the test failure is a flake? I pushed a commit to simply expose the error being suppressed in cloudformation and now it passed. My new commit should not have fixed any tests.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/lgtm
/remove-lifecycle stale
/hold cancel
@cnmcavoy I updated to v2.6.1 this morning and I saw in the logs an error about ownership of the kubeconfig secret in the log for all control plane.
E0807 08:02:32.550249 1 controller.go:329] "Reconciler error"
err="failed to reconcile control plane for AWSManagedControlPlane flux-system/xxxxxxxxx-cp:
failed reconciling kubeconfig: updating kubeconfig secret:
EKS kubeconfig flux-system/xxxxxxxxx-kubeconfig missing expected AWSManagedControlPlane ownership"
controller="awsmanagedcontrolplane"
controllerGroup="controlplane.cluster.x-k8s.io"
controllerKind="AWSManagedControlPlane"
AWSManagedControlPlane="flux-system/xxxxxxxxx-cp"
namespace="flux-system"
name="xxxxxxxxx-cp"
reconcileID="c634fd01-5c99-4947-9ff7-3297fcaff97c"
Did you encounter this issue when you tested on your side ?
EDIT: On my side, I fixed it by deleting the secret. The new secret was created with expected ownership and it get back on its feet.
@cnmcavoy I updated to v2.6.1 this morning and I see in the logs an error about ownership of the kubeconfig secret in the log for all control plane.
The secret was created with an owner reference in the previous releases as well. Previously the shape of the secret was assumed, the new behavior that you encountered is that the controller checks if it owns the secret before operating on it. Deleting and recreating was also going to be my suggestion, but the owner reference should already exist unless the secret was not created by the controller. My suspicion is flux or some other system created the secret and so CAPA didn't set the owner reference.