gpushare-device-plugin
gpushare-device-plugin copied to clipboard
NVIDIA_VISIBLE_DEVICES wrong value in OCI spec
Hello, I'm trying to use gpushare device plugin only for exposing gpu_mem resource from k8s gpu node in MiB. I have all the NVIDIA things like drivers, nvidia-container-runtime etc. installed and everything works fine except one thing. For example, there is a pod YAML
apiVersion: v1
kind: Pod
metadata:
namespace: text-detector
name: gpu-test-bald
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "registry.k.mycompany.com/experimental/cuda-vector-add:v0.1"
imagePullPolicy: IfNotPresent
resources:
requests:
aliyun.com/gpu-mem: "151"
limits:
aliyun.com/gpu-mem: "151"
nodeName: gpu-node10
tolerations:
- operator: "Exists"
gpu-node10
...
Capacity:
aliyun.com/gpu_count: 1
aliyun.com/gpu-mem: 32768
...
root@gpu-node10:~# nvidia-smi
Tue Nov 8 11:32:19 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 31C P0 32W / 250W | 24237MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 9268 C python3 1799MiB |
| 0 N/A N/A 12821 C python3 1883MiB |
| 0 N/A N/A 14311 C python3 2105MiB |
| 0 N/A N/A 16938 C python3 1401MiB |
| 0 N/A N/A 16939 C python3 1401MiB |
| 0 N/A N/A 29183 C python3 2215MiB |
| 0 N/A N/A 43383 C python3 1203MiB |
| 0 N/A N/A 52358 C python3 1939MiB |
| 0 N/A N/A 54439 C python3 1143MiB |
| 0 N/A N/A 54788 C python3 2123MiB |
| 0 N/A N/A 56272 C python3 1143MiB |
| 0 N/A N/A 56750 C python3 2089MiB |
| 0 N/A N/A 61595 C python3 2089MiB |
| 0 N/A N/A 71269 C python3 1694MiB |
+-----------------------------------------------------------------------------+
I've noticed NVIDIA_VISIBLE_DEVICES
became different somehow, which causes an error during container creation
Containers:
cuda-vector-add:
Container ID: docker://9eae154ebc7e662985e37777354e439d47eb0e7abb45d346be200101d64a3273
Image: registry.k.mycompany.com/experimental/cuda-vector-add:v0.1
Image ID: docker-pullable://registry.k.mycompany.com/experimental/cuda-vector-add@sha256:b09d5bc4243887012cc95be04f17e997bd73f52a16cae30ade28dd01bffa5e01
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-151MiB-to-run: unknown device: unknown
this exact error
OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-151MiB-to-run: unknown device: unknown
appears due to this ENV VAR NVIDIA_VISIBLE_DEVICES
gets unacceptable value
"NVIDIA_VISIBLE_DEVICES=no-gpu-has-151MiB-to-run"
I've handled it in container OCI spec
{
"ociVersion": "1.0.1-dev",
"process": {
"user": {
"uid": 0,
"gid": 0
},
"args": [
"/bin/sh",
"-c",
"./vectorAdd"
],
"env": [
"PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HOSTNAME=gpu-test-bald",
"NVIDIA_VISIBLE_DEVICES=no-gpu-has-151MiB-to-run", < ------ Here it is
"ALIYUN_COM_GPU_MEM_IDX=-1",
"ALIYUN_COM_GPU_MEM_POD=151",
"ALIYUN_COM_GPU_MEM_CONTAINER=151",
"ALIYUN_COM_GPU_MEM_DEV=32768",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_PORT=8890",
"TEXT_DETECTOR_STAGING_SERVICE_HOST=10.62.55.112",
"TEXT_DETECTOR_STAGING_SERVICE_PORT=8890",
"TEXT_DETECTOR_STAGING_PORT=tcp://10.62.55.112:8890",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP=tcp://10.62.55.112:8890",
"KUBERNETES_SERVICE_HOST=10.62.0.1",
"KUBERNETES_PORT_443_TCP=tcp://10.62.0.1:443",
"KUBERNETES_PORT_443_TCP_PORT=443",
"TEXT_DETECTOR_STAGING_SERVICE_PORT_HTTP=8890",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_PROTO=tcp",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_ADDR=10.62.55.112",
"KUBERNETES_PORT_443_TCP_ADDR=10.62.0.1",
"KUBERNETES_SERVICE_PORT=443",
"KUBERNETES_SERVICE_PORT_HTTPS=443",
"KUBERNETES_PORT=tcp://10.62.0.1:443",
"KUBERNETES_PORT_443_TCP_PROTO=tcp",
"CUDA_VERSION=8.0.61",
"CUDA_PKG_VERSION=8-0=8.0.61-1",
"LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
"LIBRARY_PATH=/usr/local/cuda/lib64/stubs:"
],
"cwd": "/usr/local/cuda/samples/0_Simple/vectorAdd",
"capabilities": {
"bounding": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"effective": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"inheritable": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"permitted": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
]
},
"oomScoreAdj": 1000
},
"root": {
"path": "/var/lib/docker/overlay2/5b9782752b5d79f2d3646b92e41511a3b959f3d2e7ed1c57c4e299dfb8cd6965/merged"
},
"hostname": "gpu-test-bald",
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620",
"gid=5"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"nosuid",
"noexec",
"nodev",
"ro"
]
},
{
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options": [
"ro",
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev/termination-log",
"type": "bind",
"source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/containers/cuda-vector-add/8473aa30",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/etc/resolv.conf",
"type": "bind",
"source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/resolv.conf",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/etc/hostname",
"type": "bind",
"source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/hostname",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/etc/hosts",
"type": "bind",
"source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/etc-hosts",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/dev/shm",
"type": "bind",
"source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/mounts/shm",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/var/run/secrets/kubernetes.io/serviceaccount",
"type": "bind",
"source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/volumes/kubernetes.io~secret/default-token-thv9d",
"options": [
"rbind",
"ro",
"rprivate"
]
}
],
"hooks": {
"prestart": [
{
"path": "/usr/bin/nvidia-container-runtime-hook",
"args": [
"/usr/bin/nvidia-container-runtime-hook",
"prestart"
]
}
]
},
"linux": {
"resources": {
"devices": [
{
"allow": false,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 5,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 3,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 9,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 8,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 0,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 1,
"access": "rwm"
},
{
"allow": false,
"type": "c",
"major": 10,
"minor": 229,
"access": "rwm"
}
],
"memory": {
"disableOOMKiller": false
},
"cpu": {
"shares": 2,
"period": 100000
},
"blockIO": {
"weight": 0
}
},
"cgroupsPath": "kubepods-besteffort-pod685974b9_5eb0_11ed_bada_001eb9697543.slice:docker:664e21c310b62b2e1c3537388127812c7e2f482cb5cf40fa52280e3b62cf2646",
"namespaces": [
{
"type": "mount"
},
{
"type": "network",
"path": "/proc/27057/ns/net"
},
{
"type": "uts"
},
{
"type": "pid"
},
{
"type": "ipc",
"path": "/proc/27057/ns/ipc"
}
],
"maskedPaths": [
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/proc/scsi",
"/sys/firmware"
],
"readonlyPaths": [
"/proc/asound",
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
}
}
adding NVIDIA_VISIBLE_DEVICES=all
to Pod YAML fixes it as it described here
apiVersion: v1
kind: Pod
metadata:
namespace: text-detector
name: gpu-test-bald
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "registry.k.mycompany.com/experimental/cuda-vector-add:v0.1"
imagePullPolicy: IfNotPresent
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
resources:
requests:
aliyun.com/gpu-mem: "153"
limits:
aliyun.com/gpu-mem: "153"
nodeName: gpu-node10
tolerations:
- operator: "Exists"
OCI
{
"ociVersion": "1.0.1-dev",
"process": {
"user": {
"uid": 0,
"gid": 0
},
"args": [
"/bin/sh",
"-c",
"./vectorAdd"
],
"env": [
"PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HOSTNAME=gpu-test-bald",
"ALIYUN_COM_GPU_MEM_DEV=32768",
"NVIDIA_VISIBLE_DEVICES=no-gpu-has-153MiB-to-run", <----------Here it is
"ALIYUN_COM_GPU_MEM_IDX=-1",
"ALIYUN_COM_GPU_MEM_POD=153",
"ALIYUN_COM_GPU_MEM_CONTAINER=153",
"NVIDIA_VISIBLE_DEVICES=all", <-------------------Here it is
"TEXT_DETECTOR_STAGING_PORT_8890_TCP=tcp://10.62.55.112:8890",
"KUBERNETES_SERVICE_PORT_HTTPS=443",
"KUBERNETES_PORT=tcp://10.62.0.1:443",
"TEXT_DETECTOR_STAGING_SERVICE_HOST=10.62.55.112",
"TEXT_DETECTOR_STAGING_SERVICE_PORT=8890",
"KUBERNETES_SERVICE_PORT=443",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_ADDR=10.62.55.112",
"KUBERNETES_SERVICE_HOST=10.62.0.1",
"KUBERNETES_PORT_443_TCP=tcp://10.62.0.1:443",
"TEXT_DETECTOR_STAGING_PORT=tcp://10.62.55.112:8890",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_PORT=8890",
"KUBERNETES_PORT_443_TCP_PROTO=tcp",
"KUBERNETES_PORT_443_TCP_PORT=443",
"KUBERNETES_PORT_443_TCP_ADDR=10.62.0.1",
"TEXT_DETECTOR_STAGING_SERVICE_PORT_HTTP=8890",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_PROTO=tcp",
"CUDA_VERSION=8.0.61",
"CUDA_PKG_VERSION=8-0=8.0.61-1",
"LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
"LIBRARY_PATH=/usr/local/cuda/lib64/stubs:"
],
...
Now the same Pod has been successfully created and completed
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-test-bald 0/1 Completed 0 3m40s 10.62.97.59 gpu-node10 <none> <none>
$ kubectl -f gpu-test-bald
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
So could you explain is such a behaviour of NVIDIA_VISIBLE_DEVICES
ENV VAR correct? Seems like it is not