cluster-api-provider-aws ✨ Add separate eks kubeconfig secret keys for the cluster-autoscaler

trafficstars

What type of PR is this? /kind feature

What this PR does / why we need it: Cluster Autoscaler can not mount and consume the Cluster API Kubeconfig because the secret contents are refreshed every ten minutes, and no API Machinery exists to reload a kubeconfig safely.

Initially, I attempted solve this in the Cluster Autoscaler: https://github.com/kubernetes/autoscaler/issues/4784 - However meeting with SIG APIMachinery on Nov 1 2023, the SIG cautioned against this approach and advised splitting the token out from the kubeconfig, as there is existing machinery to reload a token file auth. By switching to this approach, no change in the Cluster Autoscaler is needed, users only need to update their Cluster Autoscaler configuration to use the correct secret file from their secret volume mount.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #4607

Special notes for your reviewer:

Checklist:

[X] squashed commits
[X] includes documentation
[X] adds unit tests
[ ] adds or updates e2e tests

Release note:

Add separate eks kubeconfig secret keys for the cluster-autoscaler to support refreshing the token automatically, see eks kubeconfig for more info.

Nov 21 '23 19:11 cnmcavoy

~This is ready to be reviewed, but I haven't had an opportunity to test the full setup with this change + cluster autoscaler end to end yet, so marked as a draft until I verify it.~

I have deployed this change in Indeed's clusters and in my testing, the official release of v1.27 cluster autoscaler was able to refresh the bearer token file in EKS clusters.

Nov 21 '23 19:11 cnmcavoy

/hold cancel

Jan 02 '24 16:01 vincepri

/lgtm

Apr 18 '24 11:04 mloiseleur

/test pull-cluster-api-provider-aws-e2e /test pull-cluster-api-provider-aws-e2e-eks

Apr 19 '24 05:04 richardcase

It would be good to get cluster autoscaler added to our e2e tests. Lets create an issue to follow up on this.

From my side this look good:

/approve

When the e2e passes we can unhold and merge:

/hold

Apr 19 '24 05:04 richardcase

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [richardcase]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Apr 19 '24 05:04 k8s-ci-robot

/retest-required

Apr 19 '24 18:04 cnmcavoy

New changes are detected. LGTM label has been removed.

Only a rebase.

Apr 19 '24 18:04 cnmcavoy

/test pull-cluster-api-provider-aws-e2e /test pull-cluster-api-provider-aws-e2e-eks /lgtm

Apr 19 '24 18:04 mloiseleur

You will have to fix the test first. /lgtm cancel

Apr 20 '24 07:04 mloiseleur

/test pull-cluster-api-provider-aws-e2e-eks

Apr 22 '24 17:04 cnmcavoy

@mloiseleur seems like the test failure is a flake? I pushed a commit to simply expose the error being suppressed in cloudformation and now it passed. My new commit should not have fixed any tests.

Apr 22 '24 20:04 cnmcavoy

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 21 '24 20:07 k8s-triage-robot

/lgtm

Jul 22 '24 06:07 mloiseleur

/remove-lifecycle stale

Jul 22 '24 06:07 mloiseleur

/hold cancel

Jul 22 '24 06:07 mloiseleur

@cnmcavoy I updated to v2.6.1 this morning and I saw in the logs an error about ownership of the kubeconfig secret in the log for all control plane.

E0807 08:02:32.550249       1 controller.go:329] "Reconciler error" 
err="failed to reconcile control plane for AWSManagedControlPlane flux-system/xxxxxxxxx-cp: 
 failed reconciling kubeconfig: updating kubeconfig secret: 
  EKS kubeconfig flux-system/xxxxxxxxx-kubeconfig missing expected AWSManagedControlPlane ownership" 
controller="awsmanagedcontrolplane"
controllerGroup="controlplane.cluster.x-k8s.io"
controllerKind="AWSManagedControlPlane" 
AWSManagedControlPlane="flux-system/xxxxxxxxx-cp"
namespace="flux-system"
name="xxxxxxxxx-cp"
reconcileID="c634fd01-5c99-4947-9ff7-3297fcaff97c"

Did you encounter this issue when you tested on your side ?

EDIT: On my side, I fixed it by deleting the secret. The new secret was created with expected ownership and it get back on its feet.

Aug 07 '24 08:08 mloiseleur

@cnmcavoy I updated to v2.6.1 this morning and I see in the logs an error about ownership of the kubeconfig secret in the log for all control plane.

The secret was created with an owner reference in the previous releases as well. Previously the shape of the secret was assumed, the new behavior that you encountered is that the controller checks if it owns the secret before operating on it. Deleting and recreating was also going to be my suggestion, but the owner reference should already exist unless the secret was not created by the controller. My suspicion is flux or some other system created the secret and so CAPA didn't set the owner reference.

Aug 07 '24 16:08 cnmcavoy

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

✨ Add separate eks kubeconfig secret keys for the cluster-autoscaler

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard