cluster-api-provider-aws AWSMachinePool does not work in combination with cluster-autoscaler

/kind bug

What steps did you take and what happened:

Create a workload cluster with the experimental EKS Control Plane
Create a MachinePool without specifying replicas and create associated AWSMachinePool resources.
Install cluster-autoscaler and configure it to autoscale the autoscaling group managed by the AWSMachinePool
Create a deployment that will trigger cluster-autoscaler to scale up the autoscaling group.

This caused the AWSMachinePool controller and the cluster-autoscaler to both attempt to manage DesiredInstances. The result being the DesiredInstances value on the ASG was alternating between the value cluster-autoscaler computed as needed and the value of replicas: 1 coming from the defaulted value on the MachinePool object.

What did you expect to happen: cluster-autoscaler should manage the DesiredInstances of the autoscaling group and the AWSMachinePool controller should ignore changes to the replicas and DesiredInstances fields.

Anything else you would like to add: Even though replicas is an optional field on the MachinePool resource, it defaults to 1, so leaving the value unspecified is insufficient to prevent the AWSMachinePool controller from attempting to managed the DesiredInstance count.

Environment:

Cluster-api-provider-aws version: Commit: 3338cd4
Kubernetes version: (use kubectl version): v.1.17.9
OS (e.g. from /etc/os-release): Amazon Linux 2

Oct 13 '20 17:10 dthorsen

I'm not sure this is a bug actually, there's going to be a clash of behaviour here.

I think you'll want to use the Cluster API native mode of Cluster Autoscaler? @benmoss is more knowledgeable in this area.

There's probably outstanding tasks we need to have in terms of reaching functional parity. In the meantime, at worst, possibly, and I'm loathe to do it, add an immediately and a deliberately deprecated field that is "disableReplicaCountReconciliation"

Oct 14 '20 09:10 randomvariable

@randomvariable There is indeed a clash of behavior. I listed it as a bug because it failed to operate in the manner we originally designed due to the fact that the MachinePool replicas can't really be nil. The original thought being, if you don't set replicas at all, then CAPA won't manage replicas/DesiredInstances. If you want CAPA to manage them, then you set replicas. That seemed like a sensible interface.

At the moment there are no Machine or AWSMachine objects created from a MachinePool. Only Node objects. For the time being for our use cases, we planned to use CAPA to create the ASG's and let cluster-autoscaler manage scaling. It enables us to take advantage of cluster-autoscaler's full feature set today including things like scale to and from 0. I think a field in the spec would be a good idea to make this behavior configurable for the user. I am not certain it should be a deprecated field. It is hard to imagine all the possible use cases around autoscaling groups and where the replica count may be best controlled. I have been told also that cluster-autoscaler is frequently forked and customized for special use cases. I think flexibility in the interface could be valuable for the foreseeable future.

Oct 14 '20 19:10 dthorsen

Interesting data point for @elmiko too here.

Oct 14 '20 19:10 randomvariable

thanks for the ping @randomvariable , an interesting issue for sure.

i agree with @randomvariable , there is a clash happening between the MachinePool and the way the cluster-autoscaler is attempting to scale. i don't know the MachinePool mechanics very well, but from the cluster-autoscaler side the capi provider attempts to manipulate the MachineSet and MachineDeployment replicas when it wants to scale. if these replica counts are disjointed from the MachinePool then i could envision some unexpected behavior.

edit:

if the cluster-autoscaler is being configured to use the aws provider, then there might be some issues between what the capi machinery is doing and how the cluster-autoscaler is attempting to scale. if this is the case, then i would expect the issue to be more on the capi/capa side of things.

Oct 14 '20 19:10 elmiko

That is correct @elmiko, this behavior is observed when operating cluster-autoscaler with the AWS provider, not the CAPI provider. In our local installation we have worked around this for now by simply telling CAPA not to react to replicas as long as replicas is set to 1. This allows cluster-autoscaler to work, so if we had a field in the AWSMachinePool spec to more properly ignore replicas, that would allow cluster-autoscaler with AWS provider to just work as it would with any other ASG.

Oct 14 '20 19:10 dthorsen

that makes sense @dthorsen, seems like this is really a feature request to add something that will ignore the replicas in this type of situation.

Oct 14 '20 19:10 elmiko

I don't really understand this, why does the AWSMachinePool controller keep changing DesiredInstances? It's not clear this is an autoscaler bug, seems more appropriate to file it against CAPA.

Oct 14 '20 19:10 benmoss

@benmoss This bug is filed against CAPA, not cluster-autoscaler.

@elmiko I am OK with changing it to a CAPA feature request instead of a CAPA bug if that is everyone's preference based on the context provided. I only listed it as bug because it did not behave as we initially designed it.

Oct 14 '20 19:10 dthorsen

/kind feature

Oct 14 '20 20:10 dthorsen

/remove-kind bug

Oct 14 '20 20:10 dthorsen

Hah, my mistake 😸

Oct 15 '20 15:10 benmoss

/assign

Nov 04 '20 15:11 dthorsen

I propose that we add an optional field to the AWSMachinePool spec called ignoreReplicas which defaults to false. The reconciliation logic will use this field to ignore the MachinePool replicas field and to always omit DesiredInstances field in any requests to the AWS AutoScalingGroup API.

@MarcusNoble @randomvariable Does this sound like an acceptable solution?

Nov 04 '20 15:11 dthorsen

Sounds good to me. Just not 100% sure on the naming. replicas makes me think about kubernetes resources rather than ASG instances. How about something like ignoreDesiredSize ?

Nov 05 '20 07:11 MarcusNoble

@MarcusNoble Hmm, technically we are ignoring both when this is set. From cluster-api user's perspective, if they update replicas on the MachinePool Kubernetes resource, the controller will do nothing if this field is set.

Nov 05 '20 15:11 dthorsen

That is a good point. The name makes sense then.

The alternative is to handle an unset replicas value as meaning ignore, rather than setting a new property.

Edit: looking at the code that might actually be what's currently happening if I'm following correctly. If no replicas is set on the MachinePool then it looks like it'll default to MinSize (as that's what AWS does) and never get updated by CAPA.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/e526925b25ff006774885d129a16de6da504f196/pkg/cloud/services/autoscaling/autoscalinggroup.go#L163-L165

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/e526925b25ff006774885d129a16de6da504f196/pkg/cloud/services/autoscaling/autoscalinggroup.go#L208-L210

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/e526925b25ff006774885d129a16de6da504f196/pkg/cloud/services/autoscaling/autoscalinggroup.go#L280-L282

Nov 05 '20 16:11 MarcusNoble

That was the original intent, but the code doesn't work as intended because replicas will never be nil (also incidentally why I originally opened this issue as a bug). It will be defaulted to a value of 1 if omitted. It also occurred to me that using replicas to indicate "don't reconcile this value" is a pretty fragile indicator. If cluster-autoscaler has scaled up the pool to say, 100 instances, and someone comes back and misunderstands how it works and sets replicas to 120, AWSMachinePool will suddenly begin reconciling replicas and scale up to 120 instances. Then cluster-autoscaler will try to scale down, and we are back into a thrashing scenario. I think based on this feedback I am going to start on a PR to implement a proper ignoreReplicas spec field.

Nov 05 '20 17:11 dthorsen

@MarcusNoble @randomvariable I am finally getting back to this work. Sorry for the delay. After I observed the following behavior in our clusters, I realized the poor UX of totally ignoring the replicas field:

$ kubectl get machinepool
NAMESPACE         NAME                                                     REPLICAS   PHASE         VERSION
test-cluster      test-cluster-nodes-a                                     10         ScalingDown   v1.17.12

The MachinePool controller thinks this pool is always "scaling down" because the replicas field in the spec is set to 1 on the MachinePool, but the replicas field on the status has a value of 10. I am leaning toward making this field called something more like externallyManagedReplicaCount, and enhance the logic to have the AWSMachinePool reconciliation loop update the MachinePool .spec.replicas field too in order to make the UX a bit more sane.

An alternative approach I think would be some change to the MachinePool spec that allows delegation of replica control to the infrastructure provider at the CAPI level instead of doing this at the infrastructure level. I'd like to hear your input as to which approach would make more sense.

Jan 20 '21 22:01 dthorsen

Where's the value of 10 coming from on the status? Is that the current number of ASG instances?

Jan 21 '21 07:01 MarcusNoble

@dthorsen you might want to chat to @devigned and @CecileRobertMichon , as there's similar discussion there, and we are likely to change MachinePools in v1alpha4. See https://github.com/kubernetes-sigs/cluster-api/issues/4063

Jan 21 '21 14:01 randomvariable

How would nil replicas work with K8s version upgrades? In https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/819 we are planning on enabling MaxUnavailable and MaxSurge for safe rolling upgrade scenaios. These scenarios both implicitly require an understanding of the desired number of replicas. Seems like externalizing management of replica count would impair other behaviors or at a minimum require some coordination for other behaviors. Would it not make more sense to inform CA that the infra is managed by a CAPI and to manipulate replica counts via CAPI MachinePool?

Jan 21 '21 15:01 devigned

Thank you @randomvariable.

@devigned replicas is not nil, it is not allowed to be nil as it will default to 1. The conversation going on in kubernetes-sigs/cluster-api#4063 is interesting. I'll chime in a bit in that thread as well. My goal was to push more of the instance management logic into the underlying Cloud Provider. Be able to utilize things like Instance Refresh, and Managed Node Groups, and enable cluster-autoscaler to use the AWS provider.

Jan 21 '21 16:01 dthorsen

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jun 09 '21 19:06 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

Jul 28 '21 17:07 k8s-triage-robot

/remove-lifecycle rotten

Jul 28 '21 17:07 devigned

/cc @mboersma

Jul 28 '21 17:07 devigned

👋 Just adding a data point if still needed. In order to use CAPI in production we would like to utilise MachinePools instead of MachineDeployments with EKS, as MachineDeployments rely on the management cluster as a target for Cluster Autoscaler to scale. We would like the worker clusters to be as independent operationally from the management clusters as possible. A second motivation for us to use MachinePools and have them autoscaled independently of CAPI is so we can utilise other specialised autoscalers like https://github.com/atlassian/escalator without having to add support for targeting CAPI as a backend.

Oct 21 '21 06:10 Jacobious52

@dthorsen from the discussion in Slack, I think in your fork you have the API change in CAPI rather than CAPA. Can you open up an issue there? We can continue with this one for tracking...

Nov 08 '21 19:11 randomvariable

/priority important-soon

Nov 08 '21 19:11 randomvariable

@randomvariable The field is set in CAPI, but the implementation is delegated to the providers. So I will have a PR to both CAPI and CAPA when I'm done. I'm starting on this now.

Nov 12 '21 17:11 dthorsen

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

AWSMachinePool does not work in combination with cluster-autoscaler

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard