machine-controller-manager
machine-controller-manager copied to clipboard
Label gpu nodes
How to categorize this issue?
/area auto-scaling /kind enhancement /priority 3
What would you like to be added: Label the nodes with gpu with the autoscaler label "worker.gardener.cloud/accelerator"
Why is this needed: Several reasons:
- Cluster Autoscaler can then efficiently scale-down such nodes
- CA will treat nodes with gpu label as
NotReady(even if they areReadyin reality) until the nodes broadcast theirgpuresources (means till the gpu drivers get installed). This helps in casescaleDownUnneededTime=0which is often done by customers in their cluster to help blue-green rollout in maintenance window. For detail on the use-case refer to https://sap-ti.slack.com/archives/C9CEBQPGE/p1700465892059819
@himanshu-kun Label area/auotscaling does not exist.
The labels can be added by looking at the machineClass.nodeTemplate.capacity.gpu field. The only problem is that currently mcm doesn't reconcile machine immediately on an event on machineClass which needs to be done as also stated in the issue https://github.com/gardener/machine-controller-manager/issues/517
The autoscaler could segregate gpu nodes based on the GPU label which the implementation defines(Autoscaler could get to know abt it through the interface method GPULabel(). It then calculates only gpu utilization for the gpu nodes and has a different threshold defined for them.
Refer https://github.com/gardener/autoscaler/blob/dacb105216e2fe6d49e801e8f36cdaf1b8f0a7da/cluster-autoscaler/core/scale_down.go#L638-L652