machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Label gpu nodes

Open himanshu-kun opened this issue 3 years ago • 3 comments

How to categorize this issue?

/area auto-scaling /kind enhancement /priority 3

What would you like to be added: Label the nodes with gpu with the autoscaler label "worker.gardener.cloud/accelerator"

Why is this needed: Several reasons:

  • Cluster Autoscaler can then efficiently scale-down such nodes
  • CA will treat nodes with gpu label as NotReady (even if they are Ready in reality) until the nodes broadcast their gpu resources (means till the gpu drivers get installed). This helps in case scaleDownUnneededTime=0 which is often done by customers in their cluster to help blue-green rollout in maintenance window. For detail on the use-case refer to https://sap-ti.slack.com/archives/C9CEBQPGE/p1700465892059819

himanshu-kun avatar Jun 08 '22 14:06 himanshu-kun

@himanshu-kun Label area/auotscaling does not exist.

gardener-robot avatar Jun 08 '22 14:06 gardener-robot

The labels can be added by looking at the machineClass.nodeTemplate.capacity.gpu field. The only problem is that currently mcm doesn't reconcile machine immediately on an event on machineClass which needs to be done as also stated in the issue https://github.com/gardener/machine-controller-manager/issues/517

himanshu-kun avatar Jun 08 '22 14:06 himanshu-kun

The autoscaler could segregate gpu nodes based on the GPU label which the implementation defines(Autoscaler could get to know abt it through the interface method GPULabel(). It then calculates only gpu utilization for the gpu nodes and has a different threshold defined for them. Refer https://github.com/gardener/autoscaler/blob/dacb105216e2fe6d49e801e8f36cdaf1b8f0a7da/cluster-autoscaler/core/scale_down.go#L638-L652

himanshu-kun avatar Dec 06 '22 08:12 himanshu-kun