cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
allow annotation on MachinePool to set externally managed
What type of PR is this? /kind feature
What this PR does / why we need it:
As a cluster operator, I can mark a MachinePool with the annotation cluster.x-k8s.io/replicas-managed-by-autoscaler to indicate it should not enforce the MachinePool.replicas field but instead reconcile it to match the ASGs desiredCapacity field. This allows me to use an external autoscaler which does not come into conflict with the MachinePool reconciliation loop.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2022
Special notes for your reviewer: The annotation is the same as in kubernetes-sigs/cluster-api-provider-azure#2444.
Unfortunately, I did not yet manage to make the added test in awsmachinepool_controller_test.go pass. This is also why the PR is marked as WIP/Draft. The test fails with:
I0822 10:22:31.013336 20405 awsmachinepool_controller.go:256] "msg"="Setting MachinePool replicas to ASG DesiredCapacity" "external"=1 "local"=0
awsmachinepool_controller_test.go:313:
Expected
<errors.aggregate | len:1, cap:1>: [
<*meta.NoKindMatchError | 0x14000263ec0>{
GroupKind: {
Group: "cluster.x-k8s.io",
Kind: "MachinePool",
},
SearchedVersions: ["v1beta1"],
},
]
to be nil
I believe the issue is that the API client does not know about the MachinePool CRD. I added a fake machinepool CRD just like the one for Machines/Clusters but it did not resolve the issue. Would be great if somebody could have a look at it.
Checklist:
- [x] squashed commits
- [ ] includes documentation
- [x] adds unit tests
- [ ] adds or updates e2e tests
@mweibel: This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @mweibel. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/ok-to-test
/test ?
@sedefsavas: The following commands are available to trigger required jobs:
/test pull-cluster-api-provider-aws-build/test pull-cluster-api-provider-aws-test/test pull-cluster-api-provider-aws-verify
The following commands are available to trigger optional jobs:
/test pull-cluster-api-provider-aws-apidiff-main/test pull-cluster-api-provider-aws-e2e/test pull-cluster-api-provider-aws-e2e-blocking/test pull-cluster-api-provider-aws-e2e-clusterclass/test pull-cluster-api-provider-aws-e2e-conformance/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts/test pull-cluster-api-provider-aws-e2e-eks
Use /test all to run the following jobs that were automatically triggered:
pull-cluster-api-provider-aws-apidiff-mainpull-cluster-api-provider-aws-buildpull-cluster-api-provider-aws-testpull-cluster-api-provider-aws-verify
In response to this:
/test ?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
There is a related PR to this in CAPI: https://github.com/kubernetes-sigs/cluster-api/pull/7107
@sedefsavas yes saw that one, thanks. Once that is merged and released, this code could reuse the const from CAPI.
I still didn't figure out the issue with the failing test. @richardcase had a look at it IIRC but not sure what the state is.
/test ?
@mweibel: The following commands are available to trigger required jobs:
/test pull-cluster-api-provider-aws-build/test pull-cluster-api-provider-aws-test/test pull-cluster-api-provider-aws-verify
The following commands are available to trigger optional jobs:
/test pull-cluster-api-provider-aws-apidiff-main/test pull-cluster-api-provider-aws-e2e/test pull-cluster-api-provider-aws-e2e-blocking/test pull-cluster-api-provider-aws-e2e-clusterclass/test pull-cluster-api-provider-aws-e2e-conformance/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts/test pull-cluster-api-provider-aws-e2e-eks
Use /test all to run the following jobs that were automatically triggered:
pull-cluster-api-provider-aws-apidiff-mainpull-cluster-api-provider-aws-buildpull-cluster-api-provider-aws-testpull-cluster-api-provider-aws-verify
In response to this:
/test ?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/retest
I'm trying every now and then when I find some time to figure out the reason why the MachinePool Kind is not found in the tests. I'd love some help on this topic :)
I had a quick look at this and i also couldn't work out why it was failing.
@mweibel - i've got to the bottom of the NoKindMatchError error you are seeing. I am going to make a small PR later today. Once that merges you will need to rebase.
That change has merged. So if you rebase you should be able to get past the NoKindMatchError error.
@richardcase oh - that change makes sense :+1: thanks for investigating and fixing it! Will rebase as soon as I'm able (currently on vacation)
Thanks @mweibel for putting this together! We're also interested in this extension because we want to use cluster-autoscaler to manage ASG nodes.
Question for my understanding, when the desired capacity on the ASG changes (from whatever external component is managing the ASG), what ensures that the AWSMachinePool will be updated to reflect the new desired capacity (as well as the spec.providerIDs list)?
Assuming that an ASG change will not result in any change to Kubernetes resource state, I'm wondering if we need some sort of polling mechanism—the AWSMachinePool reconciler periodically polls the AWS API to check the ASG state
/lgtm
/lgtm
/test pull-cluster-api-provider-aws-e2e /test pull-cluster-api-provider-aws-e2e-eks
I do have 1 concern related to us patching the values in the MachinePool. Consider this scenario invloving GitOps:
- Create a set of cluster manifests where MachinePool has replicas 1 and the cluster.x-k8s.io/replicas-managed-by-autoscaler annotation and apply these to Git
- Your GitOps implementation of choice (flux, argo, kapp) applies this to the management cluster and your cluster (and AWS infra) is created.
- The ASG scales from 1 replica to 2
- AWSMachinePool reconciler runs and sees the ASG has 2 replicas and updates the MachinePool replicas to 2 and patches.
- MachinePool now differs from what is in git and so its reconciled back to replicas 1
- Goto 4.
So we end up in this loop of 4 & 5 that will keep patching the replicas on the MachinePool.
Does that make sense? Am I other thinking this?
@Skarlso - be good to get your thoughts on this.
e2e tests are passing :+1:
Ah, that's a good catch. We actually ran into a similar problem with one of the issues. That was patching some template which ended up being out of sync with what you basically had locally. So once you applied again, the changes were lost.
Interesting thought. I don't see an option out of that other than having an unmanaged field perhaps.
@richardcase that's a good catch indeed. We're using ArgoCD where you can ignore differences. A user with a GitOps solution would need to do that:
ignoreDifferences:
- group: cluster.x-k8s.io
kind: MachinePool
jsonPointers:
- /spec/replicas
I agree that this is not beautiful. One way this could be different when CAPI would provide e.g. a status.replicas field which overrules spec.replicas field in case it's set, so spec.replicas would only be considered for the first reconciliation. Not sure if that's a good approach though.
@mweibel that is exactly what we are doing for autoscaled MachinePools as well. We also use argocd.
ah, another thing which came to my mind: if I'm not mistaken, even when using the clusterapi cluster-autoscaler (i.e. not externally managed), the GitOps difference would need to be ignored.
An even better option is to completely omit the replicas field from your manifest.
An even better option is to completely omit the replicas field from your manifest.
@dthorsen - i think thats probably the best solution.
What do you think to the following:
- Change the MachinePool update logic that in the
AWSMachinePoolreconciler so that it only updates the replicas if its not nil - Add a documentation for externally managed replicas that includes:
- The option/recommendation of not setting replicas on
MachinePoolwhen using externally managed - The sample above for ignoring the field with argo
- The option/recommendation of not setting replicas on
An even better option is to completely omit the replicas field from your manifest.
Wouldn't that still fail? Because it's nil locally, but set upstream?
Wouldn't that still fail? Because it's nil locally, but set upstream?
Its omit it from the upstream MachinePool
Wouldn't that still fail? Because it's nil locally, but set upstream?
Its omit it from the upstream
MachinePool
@richardcase looking at the webhook code, I believe this wouldn't work since the webhook would set it anyway: https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/api/v1beta1/machinepool_webhook.go#L54..L56
or do I miss something there?