autoscaler Support deallocate mode for Azure vmss when shutting down

Which component are you using?:

Cluster autoscaler for Azure

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Azure does not support customizing the node image, but it does allow to stop (Deallocate) nodes in a nodepool. This would drastically speed up the startup of our nodes using sysbox (which is now installed by a deamonset at startup). https://github.com/nestybox/sysbox/blob/master/docs/user-guide/install-k8s.md

Describe the solution you'd like.:

We should be able to configure a given vmss with a tag/label to indicate it should Deallocate a node instead of Delete it (delete should be the default).
In the scale_down code we should use this property to either call DeleteInstances or DeallocateInstances for a vmss.
in the scale_up: first start deallocated nodes before scaling up the vmss

Describe any alternative solutions you've considered.:

First looked at customizing the node image (this was our preferred option), but this is not allowad by Azure, only the capi project or the legacy aks-engine project support it.

Additional context.: In the Azure docs about the shutdownMode, they mention it as their solution to speed up the initial node creation. https://learn.microsoft.com/en-us/azure/aks/scale-down-mode

Jul 03 '23 13:07 nclaeys

We already support deallocated mode in the AKS managed autoscaler. As you linked, one can configure the scale-down-mode in the Agentpool api, to configure their nodepool to use deallocate mode.

Ideally, we would like to bring these changes to the upstream, but there is a lot involved in our deallocate mode feature, and its specific to our cloud provider.

We do start the deallocated instances first i believe and do everything else besides specifying an individual vmss to deallocate. We specify the deletion mode/(and startup mode) by the nodepool scale-down-mode.

Are you using an unmanaged cluster autoscaler deployed to your nodes or are you using the aks managed autoscaler?

Jul 04 '23 19:07 Bryce-Soghigian

Hi Bryce,

From the Azure docs I seemed indeed that it should be working. Is this fork open source, such that I can take a look?

We run an unmanaged cluster autoscaler because of some issues with the manage one:

need to support scale down to 0 for all our nodepools: https://github.com/Azure/AKS/issues/2976
support all configuration options for the cluster autoscaler
allows us to easily migrate to nodepools when upgrading to newer kubernetes version

Jul 05 '23 07:07 nclaeys

Can you share your configuration options you set, and maybe if they are shared with other customers we can bring them back to the managed AKS autoscaler, and nodepool api? No promises, just if there is enough overlap with other customers.

We support scale from zero with our managed autoscaler as well. And we have auto upgrader for easy migration of nodepools on new k8s versions, that i have found useful for my own personal clusters upgrade.

We do not support scale from zero or deallocate mode upstream, but both are supported managed aks. We would like to contribute some of these features upstream, but in a way where upstream will accept these changes, and it won't be too disruptive for other cloud providers.

Could you outline

Flags you use in your autoscaler configuration
Where are the migration issue pain points when using auto upgrader?

Jul 05 '23 17:07 Bryce-Soghigian

Hi Bryce,

Most of our effort related to the cluster-autoscaler on AKS dates from last year. Back then we had too many issues at our customers due to the taints causing scale up to fail. ALso back then the managed cluster-autoscaler did not allow most of the configuration options (e.g. scale-down-unneeded, scale-down-utilization-threshold,...). As far as I know, at the moment you support all necessary configuration options.

Our main issue still remains the fact that we cannot specify user-defined taints to ignore (apart from the remediator taint), which impacts the cluster-autoscaler scaling up from 0. We have our own taints, which we can change to include the ignoreTaintsPrefix, but also use for example sysbox on certain nodepools which has its own taints. For these it is not feasible to fork the project just to rename the taint.

We indeed do not use the auto upgrader provided, mainly because of 2 issues which has to do to minimize the potential impact on our customers.

Certain of the components we deploy at customers are tight to specific k8s version, which is why we want to keep control of upgrading those after testing them out on our internal clusters before rolling it out at our customers.
Maybe even more importantly is to reduce the impact of the rollout at our customers. Since we run primarily batch jobs at our customers, we cannot just drain all the nodes, as then these jobs would fail which is not what we want. To me it is not completely clear how the auto-upgrade works internally which is why I do not want to rely on it. We want:

cordon old nodes
create new nodes
delete the old nodes only when they are unneeded

A last suggestion or question is that to me it is not at all clear on what improvements Azure makes with regard to the internal components. For example I always assumed you are running a fork of the cluster-autoscaler but where the functionality differs from the upstream repository is not at all clear to me. I find it a pity that you use this open-source component, which has been developed by the community, forked it and do not make it available to the community.

Jul 06 '23 09:07 nclaeys

/assign Bryce-Soghigian

Jul 18 '23 08:07 Bryce-Soghigian

Thanks for providing this perspective. We are committed to lowering the difference as much as possible between our managed autoscaler and the upstream autoscaler, but it's a gradual effort, and ongoing with each new Kubernetes version.

Jul 18 '23 08:07 Bryce-Soghigian

@nclaeys I'm working on implementing VMSS deallocate for the azure provider here:

https://github.com/kubernetes/autoscaler/pull/6202

I'd love your feedback on the implementation and UX!

Oct 23 '23 17:10 jackfrancis

/assign

Dec 11 '23 17:12 jackfrancis

@jackfrancis Sorry for the late response. From a user perspective this looks great. This implements the functionality that I would like to see when using vmss. Thanks for contributing this :+1: !

Dec 12 '23 09:12 nclaeys

@nclaeys I'm planning to demo this and propose to the community next Monday Jan 22 at 16:00 Poland time zone (details here) if you'd like to attend and add feedback

Jan 16 '24 19:01 jackfrancis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 19 '24 14:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 19 '24 15:07 k8s-triage-robot

/remove-lifecycle rotten

Jul 19 '24 16:07 jackfrancis

autoscaler autoscaler copied to clipboard

Support deallocate mode for Azure vmss when shutting down

autoscaler
autoscaler copied to clipboard