Graduate MachinePools from experimental to stable API
What would you like to be added (User Story)?
As a production user of CAPI, I'm not comfortable working with "experimental" parts of the product, but I would like to take advantage of the native VM-grouping construct at my cloud provider.
[Stefan] We should look into the following issues before graduating MachinePools
Flaky tests with MachinePools:
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/11162
Fundamental issues / questions about the MachinePool API / contract:
- [ ] We need maintainers for the MachinePool feature (see some details below in this comment)
- [ ] We should audit the MachinePool CRD (towards v1beta2), e.g. "MachinePool has two failureDomain fields: .spec.failureDomains and .spec.template.spec.failureDomain"
- [ ] MachinePools are currently reusing the MachineSpec struct which is also used in MD / MS. This probably has to change.
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/7248
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/10496
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/10943
- Note: While it sounds like some of these issues are purely because of ClusterClass, the problem is rather that there is no clear standard on how to use MachinePools, so then there is no clear way on how to manage MachinePools via ClusterClass.
- The proper way to clarify what are the supported ways to use MachinePools is to improve the contract documentation
MachinePool Machines: (not clear to me, should be double checked by folks more familiar with MP Machines)
- [ ] Metadata propagation like for MD => MS => Machine
- [ ] from the proposal: "Allow both Cluster API-driven and externally driven interactions with individual Machines in a MachinePool"
- [ ] MachineHealthChecks for MP Machines
- [ ] Can MachinePool Machines today be deleted via
delete machine? e.g. "Cluster Autoscaler needs to delete individual Machines" (based on the proposal only some implementations support that, how would a user know which?)
- [ ] It's not entirely clear to me how much of the proposal is implemented
- [ ] Foreground deletion for Machines (see #11174) (not sure how this works with MachinePools today)
- [ ] In general either feature parity between MD and MP Machines or a clear documentation about the differences would be good
Work related to the status improvements for v1beta2:
- [ ] Clarify MachinePool contract (see Documentation updates in https://github.com/kubernetes-sigs/cluster-api/issues/11105)
- [ ] Implement v1beta2 conditions for MachinePools (sub-task "MachinePool conditions" in https://github.com/kubernetes-sigs/cluster-api/issues/11105, more details in https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md#changes-to-machinepool-resource)
(Potential) Bugs:
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/10059
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/10541
- This one might just be in CAPA
Misc / TBD:
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/8258
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/6387
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/12178
- [ ] https://github.com/kubernetes-sigs/cluster-api/issues/12732
- [ ] Check all open MachinePool issues: https://github.com/kubernetes-sigs/cluster-api/issues?q=sort%3Aupdated-desc%20state%3Aopen%20label%3A%22area%2Fmachinepool%22%20is%3Aissue
Minor improvements:
- [x] https://github.com/kubernetes-sigs/cluster-api/issues/8856
As far as I can tell regarding graduation - apart from the issues listed here - the following is missing:
- [x] Move code out of exp
- [ ] Remove the feature gate
We can/should discuss which parts of the "graduation" we want to block until which issues are resolved. I"m personally fine with moving the code out of exp, but we should then probably treat removing the feature gate as the "graduation".
In general, it would be great to see some folks signing up to be maintainers of the MachinePool feature/code. To be honest, at the moment, it's mostly up to folks which are neither very familiar with MachinePools nor are they using the feature. This is not a great state if we're planning to maintain this feature going forward, which should be the minimum bar for graduation.
/kind feature
/cc @dtzar @Jont828
/triage accepted
@mboersma I would like to take this one up. Can we have discussions over subtaks/TODOs which can help me moving forward.
@mboersma AFAIK as far as I remember we have still some to do left from @Jont828 PR in order to complete the implementation of the proposal and get feature parity with MD (this is at least what I remember from some thread on the PR and the discussion we had around it and the - divide and conquer from the original PR- ) e.g. label propagation, CC support, but also machines that can't be operated, probably more
It will be great to use this issue to track all this work
It'd be also good to have a featureset comparison in the book to guide users when they should be using MachineDeployment or MachinePool
Sounds good. #8842 is the last PR I currently have open related to MachinePool Machines. I think it makes sense to put a list of follow up tasks here once that gets wrapped up.
@vincepri We have a section in the CAPI book about differences in features but I suppose we can expand it.
https://github.com/kubernetes-sigs/cluster-api/issues/8858 should probably get resolved as well before we do this
Apart from some notes in https://github.com/kubernetes-sigs/cluster-api/issues/8858, here are some comments about the API.
spec.replicasfeels weird. Its meaning depends on what the infra provider (e.g. CAPA) and cloud provider (e.g. AWS) offer. For AWS ASGs, it's min/max/desired, where min/max are inAWSMachinePool.spec.{minSize,maxSize}and desired is inMachinePool.spec.replicas. What if a cloud provider doesn't have a "desired number of replicas" setting at all? I rather see these all in the<Infra>MachinePooltype.spec.minReadySecondsis documented as "Minimum number of seconds for which a newly created machine instances should be ready." – As a user, I don't understand what the effect of this field is. Ready for that interval, and then? What does "ready" mean? Which component reports it as ready? If we don't improve the description, the value could be mistaken for the infra provider specific one (e.g. AWS ASG "instance warmup" in seconds). And since I wanted to understand this, I quickly searched in source code and didn't find this field used for machine pools – am I mistaken?- https://cluster-api.sigs.k8s.io/tasks/experimental-features/machine-pools should document a "contract" – which fields should infra providers use and how? Which are required/optional to implement?
Other than that, I think stabilizing the API is a good idea – even a stable API can be evolved, for instance by adding new fields if we really run into major missing features.
And since I wanted to understand this, I quickly searched in source code and didn't find this field used for machine pools – am I mistaken?
This should be fixed via: https://github.com/kubernetes-sigs/cluster-api/pull/9837
Sounds like there's potential to improve the godoc for MD & MachinePool, but apart from that it seems to have a clear meaning
Per the CAPI community call today as I understand it, the next steps are:
- Immediately have someone PR to have Machinepool get deployed by default (without requiring feature flag being enabled). Push it also from alpha to beta status at this time.
- Gather additional feedback after this change and consider setting a timeline with updated note(s) in the spec
/assign
/priority important-long term
@mboersma is there some work left for this issue or can we close it
@fabriziopandini: The label(s) priority/important-long, priority/term cannot be applied, because the repository doesn't have them.
In response to this:
/priority important-long term
@mboersma is there some work left for this issue or can we close it
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/priority important-longterm
@fabriziopandini MachinePools are "beta" now, but I think we need to move the code out of exp somehow. I haven't thought about how to do that yet, but it's on my plate to do.
Thanks @mboersma is there any report that can be shared of the unit and e2e test coverage for MachinePools?
I'll try to find some time next week to write a summary of the current state of MPs regarding open issues and features we were planning to implement at some point but didn't until today.
Then we can discuss how they relate to the remaining steps to get MPs out of experimental. If I understand correctly, the remaining steps are moving the implementation out of "exp" and eventually dropping the feature gate.
In any case, from my side no objection against moving the code out of "exp". I see no reason why it should stay there. In my opinion for our users it should not be relevant in which package we implement features. (I mean which end user even knows that MachinePools are below "exp" in our repo, and why should they have to care?)
Sorry for the delay. Here's my summary and my first try at a categorization.
[Moved up to issue description]
cc @fabriziopandini @vincepri @enxebre @killianmuldoon @chrischdi @mboersma
Great work @sbueringer! 🥳 Having a list of all the open issues + all those valuable notes helps me to understand better where we are, and I guess it helps others too.
Up to @mboersma if we want to move this to an Umbrella issue for graduation.
If I can give my two cents, I consider ownership + open issues the key points to be addressed before graduation; I also agree that moving the code out of exp is sort of orthogonal to graduation given that we are already shipping MachinePool APIs together with everything else.
Adding on top of the note above, If we cannot find an answer to the lack of maintainers for this feature, we should probably start thinking about a split of MP E2E tests from other tests, so we can have a cleaner release signal.
Status update. Unfortunately no one is volunteering for this work, a few of the issue above are getting closed by the bot, but the underlying issues are not getting fixed 😓
Adding a note here for a tracking issue with changes to be done #12178
Also version skew https://github.com/kubernetes-sigs/cluster-api/issues/12732
FYI PTAL also https://github.com/kubernetes-sigs/cluster-api/issues?q=state%3Aopen%20label%3A%22area%2Fmachinepool%22
as well as the the same query for closed issues (in the last ~year, many issues were closed by the bot but the problem are still there)
FYI PTAL also kubernetes-sigs/cluster-api/issues (state:open label:"area/machinepool")
as well as the the same query for closed issues (in the last ~year, many issues were closed by the bot but the problem are still there)
Added RIchard's issues and this to misc. I checked, currently there are no MachinePool issues closed with "not planned" by the bot. So I think we re-opened them all
It would be good to move the list from the comment up into the description if possible.
@mboersma Fine if I ~ just replace the current issue description with my comment with the issue list?
Fine if I ~ just replace the current issue description with my comment
@sbueringer yes, please do--thanks!
@mboersma Moved up, feel free to modify further