cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

📖 Update autoscaling from zero enhancement proposal with support for platform-aware autoscale from zero

Open aleskandro opened this issue 8 months ago • 15 comments

What this PR does / why we need it:

This PR updates the contract between the cluster-autoscaler Cluster API provider and the infrastructure provider's controllers that reconcile the Infrastructure Machine Template to support platform-aware autoscale from 0 in clusters consisting of nodes heterogeneous in CPU architecture and OS.

With this commit, the infrastructure providers implementing controllers to reconcile the status of their Infrastructure Machine Templates for supporting autoscale from 0 will be able to fill the status.nodeInfo stanza with additional information about the nodes.

The status.nodeInfo stanza has type corev1.NodeSystemInfo to reflect the same content, the rendered nodes' objects would store in their status field.

The cluster-autoscaler can use that information to build the node template labels kubernetes.io/arch and kubernetes.io/os if that information is present.

Suppose the pending pods that trigger the cluster autoscaler have a node selector or a requiredDuringSchedulingIgnoredDuringExecution node affinity concerning the architecture or operating system of the node where they can execute. In that case, the autoscaler will be able to filter the nodes groups options according to the architecture or operating system requested by the pod.

The users could already provide this information to the cluster autoscaler through the labels capacity annotation. However, there is no similar capability to support future labels/taints through information set by the reconcilers of the status of Infrastructure Machine Templates.

Which issue(s) this PR fixes

Related to https://github.com/kubernetes-sigs/cluster-api/issues/11961

/area provider/core

aleskandro avatar Mar 12 '25 12:03 aleskandro

Welcome @aleskandro!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Mar 12 '25 12:03 k8s-ci-robot

Hi @aleskandro. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Mar 12 '25 12:03 k8s-ci-robot

Thanks @elmiko. To give more context about the choice of naming and structure for everyone to discuss alternatives.

I couldn't consider the architecture as part of the capacity field because the values of keys in that stanza are expected to be a Quantity. I tried to trace back the reasoning the community did about the capacity field and I noticed that the Node objects' status field has the same field - capacity - with the same content we set in the Infrastructure Machine Template objects.

The Node objects' status field also has a nodeInfo field with the same name and type that I'm now proposing for the Infrastructure Machine Template objects.

Among the alternatives I was evaluating, this seemed the most reasonable to me.

aleskandro avatar Mar 13 '25 19:03 aleskandro

makes sense to me @aleskandro , no objection from me.

elmiko avatar Mar 14 '25 18:03 elmiko

/hold

Especially for getting consensus on and resolve https://github.com/kubernetes-sigs/cluster-api/pull/11962#discussion_r1999413833

chrischdi avatar Mar 18 '25 12:03 chrischdi

@JoelSpeed If you find some time, PTAL :)

sbueringer avatar Mar 28 '25 15:03 sbueringer

/ok-to-test

sbueringer avatar Apr 07 '25 08:04 sbueringer

lgtm from my side. Pending resolution of https://github.com/kubernetes-sigs/cluster-api/pull/11962#discussion_r2018844552

(I'll be on PTO for the next few weeks, but fine for me if there is consensus with other maintainers)

sbueringer avatar Apr 09 '25 03:04 sbueringer

/label tide/merge-method-squash

aleskandro avatar Apr 10 '25 10:04 aleskandro

i am happy with the state this is in.

/lgtm

elmiko avatar Apr 30 '25 13:04 elmiko

LGTM label has been added.

Git tree hash: 1e88b9ce31402b9352a67b7a6637e488da7436ac

k8s-ci-robot avatar Apr 30 '25 13:04 k8s-ci-robot

/unhold

aleskandro avatar May 01 '25 13:05 aleskandro

Just a few nits from my side. Considering the type of changes we are introducing, I think we should consider addressing #12033 as integral part of the implementation.

Hi @fabriziopandini , thanks for your review.

I could take #12033 and resolve it in a separate PR, if this sounds good to you.

Last nit, could you kindly also add a new line in the Implementation History down below

Done

aleskandro avatar May 08 '25 15:05 aleskandro

LGTM label has been added.

Git tree hash: 5984735564fc232d398271df15a34a34157301e5

k8s-ci-robot avatar Jul 07 '25 19:07 k8s-ci-robot

Thank you!!

/lgtm

/assign @JoelSpeed @fabriziopandini For a final look

sbueringer avatar Jul 10 '25 13:07 sbueringer

LGTM label has been added.

Git tree hash: e7c57d112b796bceda6a5512d315016b99b08312

k8s-ci-robot avatar Jul 10 '25 13:07 k8s-ci-robot

/lgtm

JoelSpeed avatar Jul 11 '25 12:07 JoelSpeed

@aleskandro congrats for your first proposal merged in CAPI! Awesome work! /lgtm /approve

fabriziopandini avatar Jul 18 '25 17:07 fabriziopandini

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 18 '25 17:07 k8s-ci-robot