AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[Feature] Expose `--startup-taints` (`--ignore-taints`) option in autoscaler profiles

Open hterik opened this issue 3 years ago • 39 comments

Is your feature request related to a problem? Please describe. Need to start nodes with custom taints, to allow required DaemonSets to start before scheduling any other pods onto the node. When doing so today however, having the taint on the NodePool will exclude the pool from upscale since the template thinks the Pod will never be able to run on the Node due to the taints, even if eventually can, once the DaemonSets have initialized the node.

Describe the solution you'd like Kubernetes Cluster autoscaler has an option called --ignore-taints to enable above use-case. It would be good if it was exposed in the AKS autoscaler profile. https://learn.microsoft.com/en-us/azure-stack/aks-hci/work-with-autoscaler-profiles

Describe alternatives you've considered As workaround, there is also a annotation-prefix one can use. ignore-taint.cluster-autoscaler.kubernetes.io/

hterik avatar Oct 17 '22 05:10 hterik

Hello @hterik

To understand better your request.

  • Does the DaemonSet that is configured on the tainted NodePool has a toleration ?
  • Does the pod to be scheduled has a toleration as well ?

I'm not understanding the purpose of what you want to do. If you require to schedule pods to a nodepool after a DaemonSet pod has started maybe you can use an init-container that does a curl to a health check to the DaemonSet pod.

Thanks in advance

carvido1 avatar Oct 18 '22 21:10 carvido1

Yes, the DaemonSet that fully initializes the node will require toleration for the taint. Other pods should not have the toleration. If initializing the DaemonSet takes very long, it may be better to schedule the pod on an old node, if such resources become available first. Otherwise the pod will be scheduled on the new node and wait very long for the DaemonSet to start up completely. In our case it's not just starting the DaemonSet, but also downloading and baking a huge dataset into a hostPath that worker pods use, it can take 10-60 minutes. You can see https://github.com/kubernetes/autoscaler/issues/5251 for a more elaborate description.

hterik avatar Oct 19 '22 04:10 hterik

Action required from @Azure/aks-pm

ghost avatar Apr 22 '23 16:04 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar May 07 '23 18:05 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar May 23 '23 00:05 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jun 07 '23 06:06 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jun 22 '23 12:06 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jul 07 '23 12:07 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jul 22 '23 18:07 ghost

Any progress on this? Because AKS forcibly taints their spot nodes, being able to ignore taints when scaling up would be nice.

aidandj avatar Jul 31 '23 20:07 aidandj

ignore-taints has been renamed to startup-taints in upstream cluster-autoscaler. https://github.com/kubernetes/autoscaler/pull/6132 https://github.com/kubernetes/autoscaler/pull/6218 The need for exposing this option in AKS remains.

hterik avatar Mar 25 '24 13:03 hterik

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

@kevinkrp93 could you provide an update please?

sjwaight avatar Jun 05 '25 00:06 sjwaight

Currently there is a workaround to annotate with status-taint.cluster-autoscaler.kubernetes.io or startup-taint.cluster-autoscaler.kubernetes.io. The cluster-autoscaler flag will not be available in the near future. Adding it to backlog.

kevinkrp93 avatar Jun 05 '25 02:06 kevinkrp93

This issue has been automatically marked as stale because it has not had any activity for 180 days. It will be closed if no further activity occurs within 7 days of this comment. @kevinkrp93

This issue will now be closed because it hasn't had any activity for 7 days after stale. @hterik feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.