aws-virtual-gpu-device-plugin icon indicating copy to clipboard operation
aws-virtual-gpu-device-plugin copied to clipboard

Autoscaler support

Open dempti opened this issue 3 years ago • 5 comments

GPU sharing works perfectly fine, but when trying to scale pods based on gpu share, cluster-autoscaler is unable to scale instances based on requirement with following errors.

clusterautoscaler-aws-cluster-autoscaler-6dbcb4d4f7-fv5w7 aws-cluster-autoscaler I0908 02:56:58.534530       1 scale_up.go:288] Pod resnet-deployment-8978c7f89-2469s can't be scheduled on eks-clusterNodegroupclusterdefa-aTwrPbQ2r3sD-60bdce6a-2014-ffda-69e8-b6f67da592f2, predicate checking error: Insufficient k8s.amazonaws.com/vgpu; predicateName=NodeResourcesFit; reasons: Insufficient k8s.amazonaws.com/vgpu; debugInfo=
clusterautoscaler-aws-cluster-autoscaler-6696574c75-zf65d aws-cluster-autoscaler I0908 03:01:29.704188       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"resnet-deployment-8978c7f89-gtnf6", UID:"a147f1be-a9d7-45a0-bb72-cd26a783ef9c", APIVersion:"v1", ResourceVersion:"5616318", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up: 1 Insufficient k8s.amazonaws.com/vgpu

dempti avatar Sep 08 '21 03:09 dempti

Works for me. Here are the steps:

  1. Create a node group config file with GPU-capable instances for the eksctl tool and:
  2. Add this label to the config file (in the labels section) k8s.amazonaws.com/accelerator: vgpu
  3. Add two tags in the tags section:
k8s.io/cluster-autoscaler/node-template/label/k8s.amazonaws.com/accelerator: vgpu
k8s.io/cluster-autoscaler/node-template/resources/k8s.amazonaws.com/vgpu: "2"
  1. Create the node group with --install-nvidia-plugin=false

The newly created nodes will be properly labeled for the vgpu plugin and the autosdcaler will know that this node group can provide the necessary resources when a pod requests them

Source (under Scaling from zero): https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html#ca-view-logs

alexpirogovski avatar Oct 07 '21 08:10 alexpirogovski

@alexpirogovski i believe the nvidia plugin isn't installed by default and we need to separately install it as a daemonset in that case is step 4 necessary , i was not able to get it to work with the 3 other additions you suggested

admiral-srinjoy avatar Jan 25 '22 08:01 admiral-srinjoy

@admiral-srinjoy you can follow this issue for solution. https://github.com/kubernetes/autoscaler/issues/4315

dempti avatar Jan 25 '22 08:01 dempti

@alexpirogovski i believe the nvidia plugin isn't installed by default and we need to separately install it as a daemonset in that case is step 4 necessary , i was not able to get it to work with the 3 other additions you suggested

@admiral-srinjoy AFAIR nvidia plugin and aws-virtual-gpu-device-plugin are mutually exclusive

alexpirogovski avatar Jan 25 '22 08:01 alexpirogovski

Thanks @dempti this helps

admiral-srinjoy avatar Jan 25 '22 09:01 admiral-srinjoy