k8s-device-plugin sometime one GPU card are allocated to multiple pod

1. Issue or feature description

Is it possible that one gpu card is assigned to multiple pod ? As I know, GPU sharing among multiple pod is not easy. But in our cluster, we have many pod use the same gpu. This is not our expectation.

As shown in following command result, we can see GPU GPU-52e32369-aced-8688-5124-395e9c636a33 are allocated to two different pod.

We didn't run any program in second pod, but we still see that the GPU have utilization.

Check gpu status in first pod

# kubectl exec -ti -n=g622ipmst108115  vru31ttwcc13-v9nxl -- nvidia-smi -L                                                                                                                                           
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-52e32369-aced-8688-5124-395e9c636a33)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-ebaee821-cd6a-8e1d-0cca-2c7d6f96dbea)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-23136e0d-ad68-0d22-75e9-5b8aba28aa2b)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-3e0613ed-1528-0b22-5fd9-68c76759264b)

# kubectl exec -ti -n=g622ipmst108115  vru31ttwcc13-v9nxl -- nvidia-smi                                                                                                                                              
Tue Sep 10 13:04:59 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   49C    P0   294W / 300W |  15149MiB / 32480MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1C:00.0 Off |                    0 |
| N/A   43C    P0   291W / 300W |  15149MiB / 32480MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   41C    P0   276W / 300W |  15149MiB / 32480MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   44C    P0   288W / 300W |  15149MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+

Check gpu status in second pod

# kubectl exec -ti -n=default jimmy-test-1 -- nvidia-smi -L                                                                                                                                                          
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-52e32369-aced-8688-5124-395e9c636a33)

#kubectl  exec -ti -n=default jimmy-test-1 -- nvidia-smi                                                                                                                                                             (twcc-admin@twcc/kube-system)
Tue Sep 10 21:05:48 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   49C    P0   201W / 300W |  15149MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+

Sep 10 '19 13:09 ogre0403

Can you show me the original pod spec for these two different pods? Did the first one make an explicit request for 4 GPUs, or just leave the request blank? By default (for better or worse) if you use one the the Nvidia sponsored CUDA images (or actually any docker image that has NVIDIA_VISIBLE_DEVICES=all), all GPUs are injected into the container whether an explicit request was made for them or not. It looks like that might be what is going on here.

Sep 10 '19 19:09 klueska

@klueska we have 8 GPU in each host, and we explicit define the nvidia.com/gpu resource. Spec of two pods are here.

We just upgrade our bare metal cluster from v1.10.11 to v1.11.10, I'm not sure if it is relative.

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: jimmy-test-label
  name: jimmy-test-1
  namespace: default
spec:
  containers:
    image: ngc/nvidia/tensorflow-18.12-py3-v1:latest
    imagePullPolicy: Always
    name: jimmy
    ports:
    - containerPort: 22
      name: ssh
      protocol: TCP
    - containerPort: 8888
      name: jupyter
      protocol: TCP
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "1"
        memory: 1Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-z6sl2
      readOnly: true
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: app
  name: vdd9n
  namespace: g62ss2
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: availability_zone
            operator: DoesNotExist
  containers:
    image: ngc/nvidia/caffe2-18.08-py3-v1:latest
    imagePullPolicy: Always
    name: xxxxx
    ports:
    - containerPort: 22
      name: ssh
      protocol: TCP
    - containerPort: 8888
      name: jupyter
      protocol: TCP
    resources:
      limits:
        cpu: "16"
        memory: 360Gi
        nvidia.com/gpu: "4"
      requests:
        cpu: "16"
        memory: 360Gi
        nvidia.com/gpu: "4"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Sep 10 '19 22:09 ogre0403

Having same issue. Any updates on this?

Jul 06 '22 01:07 RakeshRaj97

@RakeshRaj97 since this issue is quite old, could you provide information on the version of the device plugin you are running and as well as example podspecs.

Jul 06 '22 05:07 elezar

I got this fixed. In my case the error was in the YAML config file where I specified resources field twice.

  resources:
    limits:
      nvidia.com/gpu: 1
  resources:
      limits:
        cpu: "16"
        memory: 30Gi
      requests:
        cpu: "16"
        memory: 30Gi

The script picked the latter and ignored the first.

Jul 06 '22 05:07 RakeshRaj97

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

Mar 31 '24 04:03 github-actions[bot]