cluster-api Autoscaler can't work if set topology.workers.machineDeployments[0].replicas=1 in ClusterClass

What steps did you take and what happened: [A clear and concise description on how to REPRODUCE the bug.] Kubernetes version v1.23.8 cluster-api version v1.1.5 capa version v1.2.0 cluster-autoscaler v1.20.0

I deployed an autoscaler deployment against a workload cluster in the mgmt cluster. When I apply a large requests pod in the workload cluster and the pod is in a pending status. In this scenario, the autoscaler will work. I found the replicas of MachineSet change to 2 but the count of ready is always 1. I also check the description of MachineSet, and the MachineSet controller repeatedly created and deleted the machine.

What did you expect to happen: I found a workaround in this https://github.com/kubernetes-sigs/cluster-api/issues/5442#issuecomment-1190713981. He told us just unset the replicas of machineDeployments. But if I unset the replicas, it will use the default value replicas=1. I want to not only setup a cluster with replicas=3 but also make the autoscaler can work. How can I do it?

The cluster YAML like this:

spec: 
   topology
      workers:
          machineDeployments:
          - class: my-worker
            failureDomain: us-west-2a
            name: md-0
            replicas: 3

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

The action of machineset controller 21664256338_ pic

Sep 27 '22 10:09 zhouke1991

@zhouke1991: This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 27 '22 10:09 k8s-ci-robot

A workaround for this is to follow the pattern in https://github.com/kubernetes-sigs/cluster-api/issues/link but set the replica field in the MachineDeployment, where you're setting the autoscaler annotations, to whatever initial value you'd prefer.

The issue here is that the cluster topology controller will try to manage the number of replicas if the value is set in the Cluster object. If the annotations etc. are set the autoscaler will try to do the same, so you end up with two controllers setting the same field which won't work.

If you set the replicas field in the MachineDeployment when you create your cluster, future control will be left up to the autoscaler only.

Sep 27 '22 10:09 killianmuldoon

A workaround for this is to follow the pattern in https://github.com/kubernetes-sigs/cluster-api/issues/link but set the replica field in the MachineDeployment, where you're setting the autoscaler annotations, to whatever initial value you'd prefer.

The issue here is that the cluster topology controller will try to manage the number of replicas if the value is set in the Cluster object. If the annotations etc. are set the autoscaler will try to do the same, so you end up with two controllers setting the same field which won't work.

If you set the replicas field in the MachineDeployment when you create your cluster, future control will be left up to the autoscaler only.

@killianmuldoon Thanks for replying. Is the above link unavailable? I got a blank page in the issues search. Do you mean that we can set replicas=3 directly to the machinedeployment when creating a cluster and setting the autoscaler annotations?

cluster.k8s.io/cluster-api-autoscaler-node-group-max-size: "6"
cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3"

Then we can apply both the cluster and machinedeployment yaml on the mgmt cluster. So this issue can't be avoided If I provide cluster yaml only?

Sep 27 '22 13:09 zhouke1991

Apologies - the correct link is: https://github.com/kubernetes-sigs/cluster-api/issues/5442#issuecomment-1190713981

Sep 27 '22 13:09 killianmuldoon

I think almost. I think the idea was that the annotations should be set via Cluster.spec.topology...

Unfortunately that's currently not supported, but there is an issue for it and we're working on it: https://github.com/kubernetes-sigs/cluster-api/issues/7006

Sep 27 '22 13:09 sbueringer

The current sequence for working around this issue is to:

Create the cluster object
The Cluster topology controller will create a MachineDeployment with no autoscaler annotations and a default replica number (i.e. 1)
Edit the MachineDeployment to add the annotations and the desired replica count

As @sbueringer said there is an open issue for setting the annotations through ClusterClass. I don't think that will solve the issue with setting the initial replica field, however as annotations can only manage min and max size.

Sep 27 '22 13:09 killianmuldoon

I would expect that it's fine if the replica count is 1 initially as the autoscaler should upgrade to the min-size automatically (? just guessing)

Sep 27 '22 13:09 sbueringer

I think almost. I think the idea was that the annotations should be set via Cluster.spec.topology...

Unfortunately that's currently not supported, but there is an issue for it and we're working on it: #7006

@sbueringer I think I also need this feature #7006 if I create a workload cluster with autoscaler annotations(cluster.k8s.io/cluster-api-autoscaler-node-group-max-size) in machinedeployment day0. Otherwise, users have to edit the MachineDeployment to add the annotations manually.

Sep 28 '22 16:09 zhouke1991

@zhouke1991

Correct. We're working on it and I would currently expect this to land in Cluster API v1.3.0.

Sep 28 '22 17:09 sbueringer

If I got it right we can achieve the requested behavior (autoscaler controlling a MD, that should start with 3 replicas) by

Setting replicas to nil MachineDeployments
Adding cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3" on the machine deployment

@elmiko to confirm if you have some time

WRT to the second point, as of today it is not possible when using CC, but this is being already tracked in https://github.com/kubernetes-sigs/cluster-api/issues/7006, and it is somehow related to the ongoing discussion on label propagation that we are trying to address in the 1.3 release

If all this is confirmed, I would suggest to close this issue as duplicate and continue to work on https://github.com/kubernetes-sigs/cluster-api/issues/7006

Sep 29 '22 12:09 fabriziopandini

If I got it right we can achieve the requested behavior (autoscaler controlling a MD, that should start with 3 replicas) by
1. Setting replicas to nil MachineDeployments

2. Adding cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3" on the machine deployment

hmm, this is a good scenario and we need to be careful about. if the minimum size for the node group is "3" then we need to make sure that the initial replicas is also set to "3" on the MachineDeployment. the autoscaler will not try to bring a node group to its minimum, these values are just boundaries that the autoscaler will not go beyond.

if possible, i think we need the cluster class logic to be smart enough that when it sees a nil for replicas and the autoscaler annotations are set, then it should use the minimum value as the replica count. alternatively we need to be very clear about how the replicas should be set, which seems like it could be confusing.

Sep 29 '22 13:09 elmiko

if possible, i think we need the cluster class logic to be smart enough that when it sees a nil for replicas and the autoscaler annotations are set, then it should use the minimum value as the replica count. alternatively we need to be very clear about how the replicas should be set, which seems like it could be confusing.

That's a bit hard to do in the Cluster topology reconciler as it usually either reconciles a field always to a specific value or not. But always doesn't work as it would overwrite the auto-scaler continuously.

Q: Just saw that those labels are deprecated in the auto scaler (https://github.com/kubernetes/autoscaler/blob/e478ee29596f541bef3d783843532d4fe64d9a48/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils.go#L31-L32). Is there a replacement?

Sep 29 '22 13:09 sbueringer

the autoscaler will not try to bring a node group to its minimum

I don't have context on the reason behind current behaviour, but IMO this seems like ignoring cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3". @elmiko Is there room to fix this behavior?

Sep 29 '22 13:09 fabriziopandini

That's a bit hard to do in the Cluster topology reconciler as it usually either reconciles a field always to a specific value or not. But always doesn't work as it would overwrite the auto-scaler continuously.

yeah... that does sound complicated. it makes me wonder if we need to add something to that API to allow for the autoscaler? (i'm not super fond of that idea as it's very specific to one cluster addon, but i feel compelled to ask)

Q: Just saw that those labels are deprecated in the auto scaler (https://github.com/kubernetes/autoscaler/blob/e478ee29596f541bef3d783843532d4fe64d9a48/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils.go#L31-L32). Is there a replacement?

unfortunately this is another error on my part, when merging the scale from zero support i should have removed those constants as we use a more dynamic method to determine the annotation values. i will remove them.

I don't have context on the reason behind current behaviour, but IMO this seems like ignoring cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3". @elmiko Is there room to fix this behavior?

this is long standing behavior for the autoscaler, i can offer a link to the autoscaler FAQ, and my thoughts on the topic:

i believe that the autoscaler is designed to work in a very simple and predictable manner, if it observes pending pods it will attempt to make more nodes, if nodes are under utilized it will remove them. creating more nodes to reach the minimum or reducing nodes when over the maximum are behaviors that the autoscaler has not historically performed. i'm guessing this would be a large change for the autoscaler users and could also have strange effects in places where people might manually adjust the size of their node groups. i am certainly willing to bring this up with the autoscaling sig, but i have a feeling they would not want to change this behavior.

Sep 29 '22 17:09 elmiko

unfortunately this is another error on my part, when merging the scale from zero support i should have removed those constants as we use a more dynamic method to determine the annotation values. i will remove them.

Ah thx, so the annotations are not deprecated just the local constants.

I'm not sure if it's acceptable but we could adjust the behavior of the MachineDeployment webhook. As of today it always defaults replicas to 1 if replicas is not set. We could default to the value of the min-size annotation if the annotation is present.

Would work like this:

Cluster.spec.topology...replicas is not set, min-size annotation is set (only works after https://github.com/kubernetes-sigs/cluster-api/issues/7006)
Cluster topology controller creates MachineDeployment without replicas set
MachineDeployment webhook sets spec.replicas to the value of the min-size annotation, 1 if the annotation is not set
cluster-autoscaler takes over

Really not sure if that's not too hacky

Sep 29 '22 17:09 sbueringer

Ah thx, so the annotations are not deprecated just the local constants.

yeah, and https://github.com/kubernetes/autoscaler/pull/5222 =)

Would work like this:

* Cluster.spec.topology...replicas is not set, min-size annotation is set (only works after [Labels and annotations for MachineDeployments and KubeadmControlPlane created by topology controller #7006](https://github.com/kubernetes-sigs/cluster-api/issues/7006))

* Cluster topology controller creates MachineDeployment without replicas set

* MachineDeployment webhook sets spec.replicas to the value of the min-size annotation, 1 if the annotation is not set

* cluster-autoscaler takes over

Really not sure if that's not too hacky

that sounds like it would work, i'm not sure about how hacky it might be.

i guess another option might be to create explicit values in the Cluster API for the autoscaler behavior, but i'm not sure that would be any more acceptable than what you have proposed.

Sep 29 '22 18:09 elmiko

@elmiko thanks for the detailed explanation

I think that @sbueringer idea makes sense, the one downside is that we are embedding in CAPI something autoscaler specific while instead we should be agnostic, but I think we can make an exception in this case

@vincepri @enxebre opinions?

Sep 29 '22 18:09 fabriziopandini

I think that @sbueringer idea makes sense, the one downside is that we are embedding in CAPI something autoscaler specific while instead we should be agnostic, but I think we can make an exception in this case

+1, i think the desire to make this agnostic is good especially since there might be situations in the future where people want to run different autoscalers (eg someone mentioned karpenter with cluster-api).

Sep 29 '22 19:09 elmiko

One way to do this is to use a different additional annotation (something that just expresses "default replicas")

Sep 30 '22 11:09 sbueringer

If we're adding an annotation would it be better to add something even more generic - e.g. external-scaling, allow users to set replicas in the ClusterClass but only set the replica count on MD creation?

That way the flow feels more natural to me i.e. replicas are set in the replica field but scaling is no longer handled by the topology controller. It feels confusing to have two places to set the replica count.

Sep 30 '22 11:09 killianmuldoon

The problem is that if we only set it on create the subsequent update would remove the field again which again leads to the default value being set.

This would only work if the objects is created/modified exactly in this order:

Cluster topology controller creates the MD with .spec.replicas being set (and thus owns .spec.replicas)
autoscaler updates the MD and overwrites the .spec.replicas field (and thus takes ownership)
Next Cluster topology controller reconcile updates the MD without setting .spec.replicas (and thus autoscaler keeps ownership)

If between 1. and 3. the autscaler doesn't take ownership 3. should clear the field which leads to the MD webhook setting 1 as default again.

Sep 30 '22 11:09 sbueringer

If between 1. and 3. the autscaler doesn't take ownership 3. should clear the field which leads to the MD webhook setting 1 as default again.

I think there's a few ways to work around that in the implementation e.g. the webhook could check for the annotation and not default if it's there. I'm wondering though what the cleanest version of the API is for autoscaling, assuming we want one way to support multiple types of external scaling.

Sep 30 '22 11:09 killianmuldoon

i wonder if it make sense to have some sort of field or annotation that shows the MD is being managed by something else?

Sep 30 '22 20:09 elmiko

/assign

Dec 07 '22 17:12 abhijit-dev82

/triage accepted

Dec 28 '22 11:12 fabriziopandini

Summary of the comments above:

As soon as replicas is set in Cluster...machineDeployments[].replicas the Cluster topology controller will enforce this value on the MachineDeployment
The autoscaler can only take over control of the replicas field if replicas is not set in the Cluster.
When the Cluster topology controller creates a new MachineDeployment replicas is defaulted to 1.
When the replica field is unset in the Cluster, the Cluster topology controller will unset the replica field in the MachineDeployment and this will default replicas to 1 as well.
The autoscaler only takes over control of the replicas field if the current MD replicas value is in the (min-size,max-size) range.

Use cases: 1. A new MD is created and replicas should be managed by the autoscaler

When a new MD is created it will be created with the min and max size annotations to enable the autoscaler.

Today this doesn't always work as replicas is always defaulted to 1 and 1 might not be in the (min-size,max-size) range. (and because of that the autoscaler doesn't act)

2. An existing MD which initially wasn't controlled by the autoscaler should be later controlled by the autoscaler

A MD is created and initially the replica field is controlled by the Cluster topology reconciler (it is enforcing the replicas value from Cluster). The control of the replicas field of this MD should now be handed over to the autoscaler.

Today this doesn't work well, because when the topology reconciler unsets the replicas field it is defaulted to 1. This is disruptive if the MD currently has more than 1 replica (and also 1 might not be in the (min-size,max-size) range).

To summarize, I think we should provide a way to control to which value the replicas field is defaulted.

I would propose the following behavior for the defaulting webhook:

If (new) MD replicas is set => keep the current value
If machinedeployment.cluster.x-k8s.io/default-replicas is set => use the annotation value
If autoscaler min-size and max-size annotations are set:
- If it's a new MD => use min-size
- If old MD replicas is < min-size => use min-size
- If old MD replicas is > max-size => use max-size
- If old MD replicas is >= min-size && <= max-size => keep the old MD replicas value
Otherwise => use 1

(POC: #7990)

I think this is a good trade-off between explicitly supporting the autoscaler as good as we can by steering the replicas value towards the (min-size,max-size) range, while still allowing folks to set a specific default value with the machinedeployment.cluster.x-k8s.io/default-replicas if necessary. This annotation could be used independent of the autoscaler as well.

cc @enxebre @fabriziopandini

Jan 25 '23 11:01 sbueringer

Thanks for the summary! I'm +1 to the proposed solution.

While documenting this we should somehow surface that:

We are providing a smoother UX for clusters using the K8s autoscaler because we are planning to use it for Cluster API scale testing and the autoscaler currently doesn't manage MD with replica number out of the range defined by min/max size; machinedeployment.cluster.x-k8s.io/default-replicas provides a solution for other autoscalers if they have a similar issue.
The UX shortcut above (inferring replicas from min/max size) works if the K8s autoscaler is using the default cluster API node group, which we assume is the most common use case; also in this case, machinedeployment.cluster.x-k8s.io/default-replicas provides a solution for people using custom node groups names (or at least this is my suggestion for the first iteration)

/retitle MachineDeployment.spec.replicas defaulting should take into account autoscaler min/max size if defined

Jan 25 '23 13:01 fabriziopandini

excellent summary and suggestion @sbueringer , i tend to agree with @fabriziopandini and i'm +1 as well. but, i also agree about:

If old MD replicas is < min-size => use min-size

If old MD replicas is > max-size => use max-size

We are providing a smoother UX for clusters using the K8s autoscaler because we are planning to use it for Cluster API scale testing and the autoscaler currently doesn't manage MD with replica number out of the range defined by min/max size; machinedeployment.cluster.x-k8s.io/default-replicas provides a solution for other autoscalers if they have a similar issue.

we should be careful about how we release and document this because it will be a different behavior than users currently expect from the autoscaler. but, since this behavior is originating from the cluster-api side of the tooling, i think it will be ok as long as we are clear about it. we should probably add a note to the autoscaler docs as well.

Jan 25 '23 14:01 elmiko

Let's bring this up in the office hours for better visiblity

Jan 25 '23 14:01 fabriziopandini

Does the scope of this effort include enforcing cluster-autoscaler min/max when the replicas are updated?

For example, if the active CA mix value is set to 3, and if a user manually sets the replica count to 1, (IMO), CAPI should reject that new, desired configuration. Otherwise we are just adding operational thrashing (cluster-autoscaler will observe that and then "re-enforce" the minimum by scaling back up to 3).

@elmiko does that sound right?

Jan 25 '23 18:01 jackfrancis

Does the scope of this effort include enforcing cluster-autoscaler min/max when the replicas are updated?

No.

cluster-autoscaler will observe that and then "re-enforce" the minimum by scaling back up to 3

I think it would not. If the replica field is explicitly set to something outside of the (min-size,max-size) range the autoscaler doesn't act (which is expected behavior of the autoscaler)

Jan 26 '23 12:01 sbueringer

cluster-api cluster-api copied to clipboard

Autoscaler can't work if set topology.workers.machineDeployments[0].replicas=1 in ClusterClass

cluster-api
cluster-api copied to clipboard