cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

Graduate MachinePools from experimental to stable API

Open mboersma opened this issue 2 years ago • 28 comments

What would you like to be added (User Story)?

As a production user of CAPI, I'm not comfortable working with "experimental" parts of the product, but I would like to take advantage of the native VM-grouping construct at my cloud provider.

[Stefan] We should look into the following issues before graduating MachinePools

Flaky tests with MachinePools:

  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/11162

Fundamental issues / questions about the MachinePool API / contract:

  • [ ] We need maintainers for the MachinePool feature (see some details below in this comment)
  • [ ] We should audit the MachinePool CRD (towards v1beta2), e.g. "MachinePool has two failureDomain fields: .spec.failureDomains and .spec.template.spec.failureDomain"
    • [ ] MachinePools are currently reusing the MachineSpec struct which is also used in MD / MS. This probably has to change.
  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/7248
  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/10496
  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/10943
  • Note: While it sounds like some of these issues are purely because of ClusterClass, the problem is rather that there is no clear standard on how to use MachinePools, so then there is no clear way on how to manage MachinePools via ClusterClass.
    • The proper way to clarify what are the supported ways to use MachinePools is to improve the contract documentation

MachinePool Machines: (not clear to me, should be double checked by folks more familiar with MP Machines)

  • [ ] Metadata propagation like for MD => MS => Machine
  • [ ] from the proposal: "Allow both Cluster API-driven and externally driven interactions with individual Machines in a MachinePool"
    • [ ] MachineHealthChecks for MP Machines
    • [ ] Can MachinePool Machines today be deleted via delete machine? e.g. "Cluster Autoscaler needs to delete individual Machines" (based on the proposal only some implementations support that, how would a user know which?)
  • [ ] It's not entirely clear to me how much of the proposal is implemented
  • [ ] Foreground deletion for Machines (see #11174) (not sure how this works with MachinePools today)
  • [ ] In general either feature parity between MD and MP Machines or a clear documentation about the differences would be good

Work related to the status improvements for v1beta2:

  • [ ] Clarify MachinePool contract (see Documentation updates in https://github.com/kubernetes-sigs/cluster-api/issues/11105)
  • [ ] Implement v1beta2 conditions for MachinePools (sub-task "MachinePool conditions" in https://github.com/kubernetes-sigs/cluster-api/issues/11105, more details in https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md#changes-to-machinepool-resource)

(Potential) Bugs:

  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/10059
  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/10541
    • This one might just be in CAPA

Misc / TBD:

  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/8258
  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/6387
  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/12178
  • [ ] https://github.com/kubernetes-sigs/cluster-api/issues/12732
  • [ ] Check all open MachinePool issues: https://github.com/kubernetes-sigs/cluster-api/issues?q=sort%3Aupdated-desc%20state%3Aopen%20label%3A%22area%2Fmachinepool%22%20is%3Aissue

Minor improvements:

  • [x] https://github.com/kubernetes-sigs/cluster-api/issues/8856

As far as I can tell regarding graduation - apart from the issues listed here - the following is missing:

  • [x] Move code out of exp
  • [ ] Remove the feature gate

We can/should discuss which parts of the "graduation" we want to block until which issues are resolved. I"m personally fine with moving the code out of exp, but we should then probably treat removing the feature gate as the "graduation".

In general, it would be great to see some folks signing up to be maintainers of the MachinePool feature/code. To be honest, at the moment, it's mostly up to folks which are neither very familiar with MachinePools nor are they using the feature. This is not a great state if we're planning to maintain this feature going forward, which should be the minimum bar for graduation.

/kind feature

/cc @dtzar @Jont828

mboersma avatar Jul 17 '23 17:07 mboersma

/triage accepted

killianmuldoon avatar Jul 17 '23 17:07 killianmuldoon

@mboersma I would like to take this one up. Can we have discussions over subtaks/TODOs which can help me moving forward.

jayesh-srivastava avatar Jul 18 '23 10:07 jayesh-srivastava

@mboersma AFAIK as far as I remember we have still some to do left from @Jont828 PR in order to complete the implementation of the proposal and get feature parity with MD (this is at least what I remember from some thread on the PR and the discussion we had around it and the - divide and conquer from the original PR- ) e.g. label propagation, CC support, but also machines that can't be operated, probably more

It will be great to use this issue to track all this work

fabriziopandini avatar Jul 18 '23 15:07 fabriziopandini

It'd be also good to have a featureset comparison in the book to guide users when they should be using MachineDeployment or MachinePool

vincepri avatar Jul 18 '23 16:07 vincepri

Sounds good. #8842 is the last PR I currently have open related to MachinePool Machines. I think it makes sense to put a list of follow up tasks here once that gets wrapped up.

@vincepri We have a section in the CAPI book about differences in features but I suppose we can expand it.

Jont828 avatar Jul 24 '23 16:07 Jont828

https://github.com/kubernetes-sigs/cluster-api/issues/8858 should probably get resolved as well before we do this

CecileRobertMichon avatar Oct 23 '23 17:10 CecileRobertMichon

Apart from some notes in https://github.com/kubernetes-sigs/cluster-api/issues/8858, here are some comments about the API.

  • spec.replicas feels weird. Its meaning depends on what the infra provider (e.g. CAPA) and cloud provider (e.g. AWS) offer. For AWS ASGs, it's min/max/desired, where min/max are in AWSMachinePool.spec.{minSize,maxSize} and desired is in MachinePool.spec.replicas. What if a cloud provider doesn't have a "desired number of replicas" setting at all? I rather see these all in the <Infra>MachinePool type.
  • spec.minReadySeconds is documented as "Minimum number of seconds for which a newly created machine instances should be ready." – As a user, I don't understand what the effect of this field is. Ready for that interval, and then? What does "ready" mean? Which component reports it as ready? If we don't improve the description, the value could be mistaken for the infra provider specific one (e.g. AWS ASG "instance warmup" in seconds). And since I wanted to understand this, I quickly searched in source code and didn't find this field used for machine pools – am I mistaken?
  • https://cluster-api.sigs.k8s.io/tasks/experimental-features/machine-pools should document a "contract" – which fields should infra providers use and how? Which are required/optional to implement?

Other than that, I think stabilizing the API is a good idea – even a stable API can be evolved, for instance by adding new fields if we really run into major missing features.

AndiDog avatar Jan 23 '24 20:01 AndiDog

And since I wanted to understand this, I quickly searched in source code and didn't find this field used for machine pools – am I mistaken?

This should be fixed via: https://github.com/kubernetes-sigs/cluster-api/pull/9837

Sounds like there's potential to improve the godoc for MD & MachinePool, but apart from that it seems to have a clear meaning

sbueringer avatar Jan 24 '24 08:01 sbueringer

Per the CAPI community call today as I understand it, the next steps are:

  1. Immediately have someone PR to have Machinepool get deployed by default (without requiring feature flag being enabled). Push it also from alpha to beta status at this time.
  2. Gather additional feedback after this change and consider setting a timeline with updated note(s) in the spec

dtzar avatar Feb 07 '24 18:02 dtzar

/assign

mboersma avatar Mar 11 '24 16:03 mboersma

/priority important-long term

@mboersma is there some work left for this issue or can we close it

fabriziopandini avatar Apr 11 '24 18:04 fabriziopandini

@fabriziopandini: The label(s) priority/important-long, priority/term cannot be applied, because the repository doesn't have them.

In response to this:

/priority important-long term

@mboersma is there some work left for this issue or can we close it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 11 '24 18:04 k8s-ci-robot

/priority important-longterm

fabriziopandini avatar Apr 11 '24 18:04 fabriziopandini

@fabriziopandini MachinePools are "beta" now, but I think we need to move the code out of exp somehow. I haven't thought about how to do that yet, but it's on my plate to do.

mboersma avatar Apr 17 '24 15:04 mboersma

Thanks @mboersma is there any report that can be shared of the unit and e2e test coverage for MachinePools?

enxebre avatar Jun 19 '24 12:06 enxebre

I'll try to find some time next week to write a summary of the current state of MPs regarding open issues and features we were planning to implement at some point but didn't until today.

Then we can discuss how they relate to the remaining steps to get MPs out of experimental. If I understand correctly, the remaining steps are moving the implementation out of "exp" and eventually dropping the feature gate.

In any case, from my side no objection against moving the code out of "exp". I see no reason why it should stay there. In my opinion for our users it should not be relevant in which package we implement features. (I mean which end user even knows that MachinePools are below "exp" in our repo, and why should they have to care?)

sbueringer avatar Jun 20 '24 08:06 sbueringer

Sorry for the delay. Here's my summary and my first try at a categorization.

[Moved up to issue description]

cc @fabriziopandini @vincepri @enxebre @killianmuldoon @chrischdi @mboersma

sbueringer avatar Aug 01 '24 14:08 sbueringer

Great work @sbueringer! 🥳 Having a list of all the open issues + all those valuable notes helps me to understand better where we are, and I guess it helps others too.

Up to @mboersma if we want to move this to an Umbrella issue for graduation.

If I can give my two cents, I consider ownership + open issues the key points to be addressed before graduation; I also agree that moving the code out of exp is sort of orthogonal to graduation given that we are already shipping MachinePool APIs together with everything else.

fabriziopandini avatar Aug 02 '24 12:08 fabriziopandini

Adding on top of the note above, If we cannot find an answer to the lack of maintainers for this feature, we should probably start thinking about a split of MP E2E tests from other tests, so we can have a cleaner release signal.

fabriziopandini avatar Nov 11 '24 09:11 fabriziopandini

Status update. Unfortunately no one is volunteering for this work, a few of the issue above are getting closed by the bot, but the underlying issues are not getting fixed 😓

fabriziopandini avatar Feb 05 '25 14:02 fabriziopandini

Adding a note here for a tracking issue with changes to be done #12178

richardcase avatar Aug 22 '25 12:08 richardcase

Also version skew https://github.com/kubernetes-sigs/cluster-api/issues/12732

richardcase avatar Sep 10 '25 13:09 richardcase

FYI PTAL also https://github.com/kubernetes-sigs/cluster-api/issues?q=state%3Aopen%20label%3A%22area%2Fmachinepool%22

as well as the the same query for closed issues (in the last ~year, many issues were closed by the bot but the problem are still there)

fabriziopandini avatar Sep 17 '25 12:09 fabriziopandini

FYI PTAL also kubernetes-sigs/cluster-api/issues (state:open label:"area/machinepool")

as well as the the same query for closed issues (in the last ~year, many issues were closed by the bot but the problem are still there)

Added RIchard's issues and this to misc. I checked, currently there are no MachinePool issues closed with "not planned" by the bot. So I think we re-opened them all

sbueringer avatar Sep 17 '25 16:09 sbueringer

It would be good to move the list from the comment up into the description if possible.

richardcase avatar Sep 18 '25 09:09 richardcase

@mboersma Fine if I ~ just replace the current issue description with my comment with the issue list?

sbueringer avatar Sep 18 '25 09:09 sbueringer

Fine if I ~ just replace the current issue description with my comment

@sbueringer yes, please do--thanks!

mboersma avatar Sep 19 '25 13:09 mboersma

@mboersma Moved up, feel free to modify further

sbueringer avatar Sep 22 '25 14:09 sbueringer