cluster-api Graduate MachinePools from experimental to stable API

What would you like to be added (User Story)?

As a production user of CAPI, I'm not comfortable working with "experimental" parts of the product, but I would like to take advantage of the native VM-grouping construct at my cloud provider.

[Stefan] We should look into the following issues before graduating MachinePools

Flaky tests with MachinePools:

[ ] https://github.com/kubernetes-sigs/cluster-api/issues/11162

Fundamental issues / questions about the MachinePool API / contract:

[ ] We need maintainers for the MachinePool feature (see some details below in this comment)
[ ] We should audit the MachinePool CRD (towards v1beta2), e.g. "MachinePool has two failureDomain fields: .spec.failureDomains and .spec.template.spec.failureDomain"
- [ ] MachinePools are currently reusing the MachineSpec struct which is also used in MD / MS. This probably has to change.
[ ] https://github.com/kubernetes-sigs/cluster-api/issues/7248
[ ] https://github.com/kubernetes-sigs/cluster-api/issues/10496
[ ] https://github.com/kubernetes-sigs/cluster-api/issues/10943
Note: While it sounds like some of these issues are purely because of ClusterClass, the problem is rather that there is no clear standard on how to use MachinePools, so then there is no clear way on how to manage MachinePools via ClusterClass.
- The proper way to clarify what are the supported ways to use MachinePools is to improve the contract documentation

MachinePool Machines: (not clear to me, should be double checked by folks more familiar with MP Machines)

[ ] Metadata propagation like for MD => MS => Machine
[ ] from the proposal: "Allow both Cluster API-driven and externally driven interactions with individual Machines in a MachinePool"
- [ ] MachineHealthChecks for MP Machines
- [ ] Can MachinePool Machines today be deleted via delete machine? e.g. "Cluster Autoscaler needs to delete individual Machines" (based on the proposal only some implementations support that, how would a user know which?)
[ ] It's not entirely clear to me how much of the proposal is implemented
[ ] Foreground deletion for Machines (see #11174) (not sure how this works with MachinePools today)
[ ] In general either feature parity between MD and MP Machines or a clear documentation about the differences would be good

Work related to the status improvements for v1beta2:

[ ] Clarify MachinePool contract (see Documentation updates in https://github.com/kubernetes-sigs/cluster-api/issues/11105)
[ ] Implement v1beta2 conditions for MachinePools (sub-task "MachinePool conditions" in https://github.com/kubernetes-sigs/cluster-api/issues/11105, more details in https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md#changes-to-machinepool-resource)

(Potential) Bugs:

[ ] https://github.com/kubernetes-sigs/cluster-api/issues/10059
[ ] https://github.com/kubernetes-sigs/cluster-api/issues/10541
- This one might just be in CAPA

Misc / TBD:

[ ] https://github.com/kubernetes-sigs/cluster-api/issues/8258
[ ] https://github.com/kubernetes-sigs/cluster-api/issues/6387
[ ] https://github.com/kubernetes-sigs/cluster-api/issues/12178
[ ] https://github.com/kubernetes-sigs/cluster-api/issues/12732
[ ] Check all open MachinePool issues: https://github.com/kubernetes-sigs/cluster-api/issues?q=sort%3Aupdated-desc%20state%3Aopen%20label%3A%22area%2Fmachinepool%22%20is%3Aissue

Minor improvements:

[x] https://github.com/kubernetes-sigs/cluster-api/issues/8856

As far as I can tell regarding graduation - apart from the issues listed here - the following is missing:

[x] Move code out of exp
[ ] Remove the feature gate

We can/should discuss which parts of the "graduation" we want to block until which issues are resolved. I"m personally fine with moving the code out of exp, but we should then probably treat removing the feature gate as the "graduation".

In general, it would be great to see some folks signing up to be maintainers of the MachinePool feature/code. To be honest, at the moment, it's mostly up to folks which are neither very familiar with MachinePools nor are they using the feature. This is not a great state if we're planning to maintain this feature going forward, which should be the minimum bar for graduation.

/kind feature

/cc @dtzar @Jont828

Jul 17 '23 17:07 mboersma

/triage accepted

Jul 17 '23 17:07 killianmuldoon

@mboersma I would like to take this one up. Can we have discussions over subtaks/TODOs which can help me moving forward.

Jul 18 '23 10:07 jayesh-srivastava

@mboersma AFAIK as far as I remember we have still some to do left from @Jont828 PR in order to complete the implementation of the proposal and get feature parity with MD (this is at least what I remember from some thread on the PR and the discussion we had around it and the - divide and conquer from the original PR- ) e.g. label propagation, CC support, but also machines that can't be operated, probably more

It will be great to use this issue to track all this work

Jul 18 '23 15:07 fabriziopandini

It'd be also good to have a featureset comparison in the book to guide users when they should be using MachineDeployment or MachinePool

Jul 18 '23 16:07 vincepri

Sounds good. #8842 is the last PR I currently have open related to MachinePool Machines. I think it makes sense to put a list of follow up tasks here once that gets wrapped up.

@vincepri We have a section in the CAPI book about differences in features but I suppose we can expand it.

Jul 24 '23 16:07 Jont828

https://github.com/kubernetes-sigs/cluster-api/issues/8858 should probably get resolved as well before we do this

Oct 23 '23 17:10 CecileRobertMichon

Apart from some notes in https://github.com/kubernetes-sigs/cluster-api/issues/8858, here are some comments about the API.

spec.replicas feels weird. Its meaning depends on what the infra provider (e.g. CAPA) and cloud provider (e.g. AWS) offer. For AWS ASGs, it's min/max/desired, where min/max are in AWSMachinePool.spec.{minSize,maxSize} and desired is in MachinePool.spec.replicas. What if a cloud provider doesn't have a "desired number of replicas" setting at all? I rather see these all in the <Infra>MachinePool type.
spec.minReadySeconds is documented as "Minimum number of seconds for which a newly created machine instances should be ready." – As a user, I don't understand what the effect of this field is. Ready for that interval, and then? What does "ready" mean? Which component reports it as ready? If we don't improve the description, the value could be mistaken for the infra provider specific one (e.g. AWS ASG "instance warmup" in seconds). And since I wanted to understand this, I quickly searched in source code and didn't find this field used for machine pools – am I mistaken?
https://cluster-api.sigs.k8s.io/tasks/experimental-features/machine-pools should document a "contract" – which fields should infra providers use and how? Which are required/optional to implement?

Other than that, I think stabilizing the API is a good idea – even a stable API can be evolved, for instance by adding new fields if we really run into major missing features.

Jan 23 '24 20:01 AndiDog

And since I wanted to understand this, I quickly searched in source code and didn't find this field used for machine pools – am I mistaken?

This should be fixed via: https://github.com/kubernetes-sigs/cluster-api/pull/9837

Sounds like there's potential to improve the godoc for MD & MachinePool, but apart from that it seems to have a clear meaning

Jan 24 '24 08:01 sbueringer

Per the CAPI community call today as I understand it, the next steps are:

Immediately have someone PR to have Machinepool get deployed by default (without requiring feature flag being enabled). Push it also from alpha to beta status at this time.
Gather additional feedback after this change and consider setting a timeline with updated note(s) in the spec

Feb 07 '24 18:02 dtzar

/assign

Mar 11 '24 16:03 mboersma

/priority important-long term

@mboersma is there some work left for this issue or can we close it

Apr 11 '24 18:04 fabriziopandini

@fabriziopandini: The label(s) priority/important-long, priority/term cannot be applied, because the repository doesn't have them.

In response to this:

/priority important-long term

@mboersma is there some work left for this issue or can we close it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 11 '24 18:04 k8s-ci-robot

/priority important-longterm

Apr 11 '24 18:04 fabriziopandini

@fabriziopandini MachinePools are "beta" now, but I think we need to move the code out of exp somehow. I haven't thought about how to do that yet, but it's on my plate to do.

Apr 17 '24 15:04 mboersma

Thanks @mboersma is there any report that can be shared of the unit and e2e test coverage for MachinePools?

Jun 19 '24 12:06 enxebre

I'll try to find some time next week to write a summary of the current state of MPs regarding open issues and features we were planning to implement at some point but didn't until today.

Then we can discuss how they relate to the remaining steps to get MPs out of experimental. If I understand correctly, the remaining steps are moving the implementation out of "exp" and eventually dropping the feature gate.

In any case, from my side no objection against moving the code out of "exp". I see no reason why it should stay there. In my opinion for our users it should not be relevant in which package we implement features. (I mean which end user even knows that MachinePools are below "exp" in our repo, and why should they have to care?)

Jun 20 '24 08:06 sbueringer

Sorry for the delay. Here's my summary and my first try at a categorization.

[Moved up to issue description]

cc @fabriziopandini @vincepri @enxebre @killianmuldoon @chrischdi @mboersma

Aug 01 '24 14:08 sbueringer

Great work @sbueringer! 🥳 Having a list of all the open issues + all those valuable notes helps me to understand better where we are, and I guess it helps others too.

Up to @mboersma if we want to move this to an Umbrella issue for graduation.

If I can give my two cents, I consider ownership + open issues the key points to be addressed before graduation; I also agree that moving the code out of exp is sort of orthogonal to graduation given that we are already shipping MachinePool APIs together with everything else.

Aug 02 '24 12:08 fabriziopandini

Adding on top of the note above, If we cannot find an answer to the lack of maintainers for this feature, we should probably start thinking about a split of MP E2E tests from other tests, so we can have a cleaner release signal.

Nov 11 '24 09:11 fabriziopandini

Status update. Unfortunately no one is volunteering for this work, a few of the issue above are getting closed by the bot, but the underlying issues are not getting fixed 😓

Feb 05 '25 14:02 fabriziopandini

Adding a note here for a tracking issue with changes to be done #12178

Aug 22 '25 12:08 richardcase

Also version skew https://github.com/kubernetes-sigs/cluster-api/issues/12732

Sep 10 '25 13:09 richardcase

FYI PTAL also https://github.com/kubernetes-sigs/cluster-api/issues?q=state%3Aopen%20label%3A%22area%2Fmachinepool%22

as well as the the same query for closed issues (in the last ~year, many issues were closed by the bot but the problem are still there)

Sep 17 '25 12:09 fabriziopandini

FYI PTAL also kubernetes-sigs/cluster-api/issues (state:open label:"area/machinepool")

as well as the the same query for closed issues (in the last ~year, many issues were closed by the bot but the problem are still there)

Added RIchard's issues and this to misc. I checked, currently there are no MachinePool issues closed with "not planned" by the bot. So I think we re-opened them all

Sep 17 '25 16:09 sbueringer

It would be good to move the list from the comment up into the description if possible.

Sep 18 '25 09:09 richardcase

@mboersma Fine if I ~ just replace the current issue description with my comment with the issue list?

Sep 18 '25 09:09 sbueringer

Fine if I ~ just replace the current issue description with my comment

@sbueringer yes, please do--thanks!

Sep 19 '25 13:09 mboersma

@mboersma Moved up, feel free to modify further

Sep 22 '25 14:09 sbueringer