gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Specifying Specific GPU Models for Pods in Nodes with Multiple GPU Types

Open anencore94 opened this issue 1 year ago • 5 comments

2. Issue or feature description

I am currently working with a Kubernetes cluster where some nodes are equipped with multiple types of NVIDIA GPUs. For example, Node A has one A100 GPU and one V100 GPU. In such a setup, I am looking for a way to specify which GPU model should be allocated when a user creates a GPU-allocated pod.

From my understanding, in such cases, we would typically request a GPU in our pod specifications using resources.limits with nvidia.com/gpu: 1. However, this approach doesn't seem to provide a way to distinguish between different GPU models.

Is there a feature or method within the NVIDIA GPU Operator or Kubernetes ecosystem that allows for such specific GPU model selection during pod creation? If not, are there any best practices or recommended approaches to ensure a pod is scheduled with a specific type of GPU when multiple models are present in the same node?

Thank you for your time and assistance.

anencore94 avatar Jan 18 '24 08:01 anencore94

@anencore94 there is unfortunately no supported way of accomplishing this today with the device plugin API.

Dynamic Resource Allocation, a new API for requesting and allocating resources in Kubernetes, would allow us to naturally support such configurations, but it is currently an alpha feature.

cdesiniotis avatar Jan 25 '24 17:01 cdesiniotis

@cdesiniotis Thanks for sharing :). You mean implement this feature using Dynamic Resource Allocation API needs quite a long time, I guess..

anencore94 avatar Jan 30 '24 03:01 anencore94

I was able to pick the GPU by specifying the

apiVersion: v1
kind: Pod
metadata:
  name: vllm-openai
  namespace: training
spec:
  runtimeClassName: nvidia
  containers:
  - name: vllm-openai
    image: "vllm/vllm-openai:latest"
    args: ["--model", "Qwen/Qwen1.5-14B-Chat"]
+    env:
+    - name: NVIDIA_VISIBLE_DEVICES
+      value: "0"
    resources:
      limits:
        nvidia.com/gpu: 1

variable. Where the number is the zero-indexed number of my GPUs.

These other vars may also work, but have not tested them: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html

laszlocph avatar Mar 04 '24 19:03 laszlocph

@laszlocph Thanks for your case! However, I'd like to control it in k8s way. 🥲

anencore94 avatar Mar 05 '24 04:03 anencore94

I do this via nodeSelector.

kubectl get nodes -L nvidia.com/gpu.count -L nvidia.com/gpu.product
NAME            STATUS   ROLES           AGE    VERSION   GPU.COUNT   GPU.PRODUCT
dell-mx740c-2   Ready    control-plane   3d8h   v1.26.3   1           NVIDIA-A100-PCIE-40GB
dell-mx740c-3   Ready    control-plane   3d8h   v1.26.3   2           Tesla-T4
dell-mx740c-7   Ready    <none>          3d8h   v1.26.3   2           Quadro-RTX-8000
dell-mx740c-8   Ready    <none>          3d8h   v1.26.3   2           NVIDIA-A100-PCIE-40GB

I can use gpu.product as the selector to ensure the pod lands on the intended GPU type like this.

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-ver-740c-8
spec:
  restartPolicy: OnFailure
  nodeSelector:
     nvidia.com/gpu.product: "NVIDIA-A100-PCIE-40GB"
     nvidia.com/gpu.count: "2"
  containers:
  - name: nvidia-version-check
    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: "1"

jjaymick001 avatar Mar 25 '24 23:03 jjaymick001