sometime one GPU card are allocated to multiple pod
1. Issue or feature description
Is it possible that one gpu card is assigned to multiple pod ? As I know, GPU sharing among multiple pod is not easy. But in our cluster, we have many pod use the same gpu. This is not our expectation.
As shown in following command result, we can see GPU GPU-52e32369-aced-8688-5124-395e9c636a33 are allocated to two different pod.
We didn't run any program in second pod, but we still see that the GPU have utilization.
Check gpu status in first pod
# kubectl exec -ti -n=g622ipmst108115 vru31ttwcc13-v9nxl -- nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-52e32369-aced-8688-5124-395e9c636a33)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-ebaee821-cd6a-8e1d-0cca-2c7d6f96dbea)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-23136e0d-ad68-0d22-75e9-5b8aba28aa2b)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-3e0613ed-1528-0b22-5fd9-68c76759264b)
# kubectl exec -ti -n=g622ipmst108115 vru31ttwcc13-v9nxl -- nvidia-smi
Tue Sep 10 13:04:59 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 |
| N/A 49C P0 294W / 300W | 15149MiB / 32480MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:1C:00.0 Off | 0 |
| N/A 43C P0 291W / 300W | 15149MiB / 32480MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 |
| N/A 41C P0 276W / 300W | 15149MiB / 32480MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 44C P0 288W / 300W | 15149MiB / 32480MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
Check gpu status in second pod
# kubectl exec -ti -n=default jimmy-test-1 -- nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-52e32369-aced-8688-5124-395e9c636a33)
#kubectl exec -ti -n=default jimmy-test-1 -- nvidia-smi (twcc-admin@twcc/kube-system)
Tue Sep 10 21:05:48 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 |
| N/A 49C P0 201W / 300W | 15149MiB / 32480MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
Can you show me the original pod spec for these two different pods? Did the first one make an explicit request for 4 GPUs, or just leave the request blank? By default (for better or worse) if you use one the the Nvidia sponsored CUDA images (or actually any docker image that has NVIDIA_VISIBLE_DEVICES=all), all GPUs are injected into the container whether an explicit request was made for them or not. It looks like that might be what is going on here.
@klueska we have 8 GPU in each host, and we explicit define the nvidia.com/gpu resource. Spec of two pods are here.
We just upgrade our bare metal cluster from v1.10.11 to v1.11.10, I'm not sure if it is relative.
apiVersion: v1
kind: Pod
metadata:
labels:
app: jimmy-test-label
name: jimmy-test-1
namespace: default
spec:
containers:
image: ngc/nvidia/tensorflow-18.12-py3-v1:latest
imagePullPolicy: Always
name: jimmy
ports:
- containerPort: 22
name: ssh
protocol: TCP
- containerPort: 8888
name: jupyter
protocol: TCP
resources:
limits:
cpu: "1"
memory: 1Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 1Gi
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-z6sl2
readOnly: true
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
apiVersion: v1
kind: Pod
metadata:
labels:
app: app
name: vdd9n
namespace: g62ss2
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: availability_zone
operator: DoesNotExist
containers:
image: ngc/nvidia/caffe2-18.08-py3-v1:latest
imagePullPolicy: Always
name: xxxxx
ports:
- containerPort: 22
name: ssh
protocol: TCP
- containerPort: 8888
name: jupyter
protocol: TCP
resources:
limits:
cpu: "16"
memory: 360Gi
nvidia.com/gpu: "4"
requests:
cpu: "16"
memory: 360Gi
nvidia.com/gpu: "4"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
Having same issue. Any updates on this?
@RakeshRaj97 since this issue is quite old, could you provide information on the version of the device plugin you are running and as well as example podspecs.
I got this fixed. In my case the error was in the YAML config file where I specified resources field twice.
resources:
limits:
nvidia.com/gpu: 1
resources:
limits:
cpu: "16"
memory: 30Gi
requests:
cpu: "16"
memory: 30Gi
The script picked the latter and ignored the first.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.