gpushare-device-plugin NVIDIA_VISIBLE_DEVICES wrong value in OCI spec

NVIDIA_VISIBLE_DEVICES wrong value in OCI spec

Open k0nstantinv opened this issue 2 years ago • 0 comments

Hello, I'm trying to use gpushare device plugin only for exposing gpu_mem resource from k8s gpu node in MiB. I have all the NVIDIA things like drivers, nvidia-container-runtime etc. installed and everything works fine except one thing. For example, there is a pod YAML

apiVersion: v1
kind: Pod
metadata:
  namespace: text-detector
  name: gpu-test-bald
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "registry.k.mycompany.com/experimental/cuda-vector-add:v0.1"
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          aliyun.com/gpu-mem: "151"
        limits:
          aliyun.com/gpu-mem: "151"
  nodeName: gpu-node10
  tolerations:
    - operator: "Exists"

gpu-node10

...
Capacity:
 aliyun.com/gpu_count:          1
 aliyun.com/gpu-mem:         32768
 ...
root@gpu-node10:~# nvidia-smi
Tue Nov  8 11:32:19 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    32W / 250W |  24237MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9268      C   python3                          1799MiB |
|    0   N/A  N/A     12821      C   python3                          1883MiB |
|    0   N/A  N/A     14311      C   python3                          2105MiB |
|    0   N/A  N/A     16938      C   python3                          1401MiB |
|    0   N/A  N/A     16939      C   python3                          1401MiB |
|    0   N/A  N/A     29183      C   python3                          2215MiB |
|    0   N/A  N/A     43383      C   python3                          1203MiB |
|    0   N/A  N/A     52358      C   python3                          1939MiB |
|    0   N/A  N/A     54439      C   python3                          1143MiB |
|    0   N/A  N/A     54788      C   python3                          2123MiB |
|    0   N/A  N/A     56272      C   python3                          1143MiB |
|    0   N/A  N/A     56750      C   python3                          2089MiB |
|    0   N/A  N/A     61595      C   python3                          2089MiB |
|    0   N/A  N/A     71269      C   python3                          1694MiB |
+-----------------------------------------------------------------------------+

I've noticed NVIDIA_VISIBLE_DEVICES became different somehow, which causes an error during container creation

Containers:
  cuda-vector-add:
    Container ID:   docker://9eae154ebc7e662985e37777354e439d47eb0e7abb45d346be200101d64a3273
    Image:          registry.k.mycompany.com/experimental/cuda-vector-add:v0.1
    Image ID:       docker-pullable://registry.k.mycompany.com/experimental/cuda-vector-add@sha256:b09d5bc4243887012cc95be04f17e997bd73f52a16cae30ade28dd01bffa5e01
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-151MiB-to-run: unknown device: unknown

this exact error

OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-151MiB-to-run: unknown device: unknown

appears due to this ENV VAR NVIDIA_VISIBLE_DEVICES gets unacceptable value

"NVIDIA_VISIBLE_DEVICES=no-gpu-has-151MiB-to-run"

I've handled it in container OCI spec

{
  "ociVersion": "1.0.1-dev",
  "process": {
    "user": {
      "uid": 0,
      "gid": 0
    },
    "args": [
      "/bin/sh",
      "-c",
      "./vectorAdd"
    ],
    "env": [
      "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "HOSTNAME=gpu-test-bald",
      "NVIDIA_VISIBLE_DEVICES=no-gpu-has-151MiB-to-run", < ------ Here it is
      "ALIYUN_COM_GPU_MEM_IDX=-1",
      "ALIYUN_COM_GPU_MEM_POD=151",
      "ALIYUN_COM_GPU_MEM_CONTAINER=151",
      "ALIYUN_COM_GPU_MEM_DEV=32768",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_PORT=8890",
      "TEXT_DETECTOR_STAGING_SERVICE_HOST=10.62.55.112",
      "TEXT_DETECTOR_STAGING_SERVICE_PORT=8890",
      "TEXT_DETECTOR_STAGING_PORT=tcp://10.62.55.112:8890",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP=tcp://10.62.55.112:8890",
      "KUBERNETES_SERVICE_HOST=10.62.0.1",
      "KUBERNETES_PORT_443_TCP=tcp://10.62.0.1:443",
      "KUBERNETES_PORT_443_TCP_PORT=443",
      "TEXT_DETECTOR_STAGING_SERVICE_PORT_HTTP=8890",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_PROTO=tcp",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_ADDR=10.62.55.112",
      "KUBERNETES_PORT_443_TCP_ADDR=10.62.0.1",
      "KUBERNETES_SERVICE_PORT=443",
      "KUBERNETES_SERVICE_PORT_HTTPS=443",
      "KUBERNETES_PORT=tcp://10.62.0.1:443",
      "KUBERNETES_PORT_443_TCP_PROTO=tcp",
      "CUDA_VERSION=8.0.61",
      "CUDA_PKG_VERSION=8-0=8.0.61-1",
      "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
      "LIBRARY_PATH=/usr/local/cuda/lib64/stubs:"
    ],
    "cwd": "/usr/local/cuda/samples/0_Simple/vectorAdd",
    "capabilities": {
      "bounding": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ],
      "effective": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ],
      "inheritable": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ],
      "permitted": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ]
    },
    "oomScoreAdj": 1000
  },
  "root": {
    "path": "/var/lib/docker/overlay2/5b9782752b5d79f2d3646b92e41511a3b959f3d2e7ed1c57c4e299dfb8cd6965/merged"
  },
  "hostname": "gpu-test-bald",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "tmpfs",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/pts",
      "type": "devpts",
      "source": "devpts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620",
        "gid=5"
      ]
    },
    {
      "destination": "/sys",
      "type": "sysfs",
      "source": "sysfs",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "ro"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "cgroup",
      "options": [
        "ro",
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/dev/mqueue",
      "type": "mqueue",
      "source": "mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/dev/termination-log",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/containers/cuda-vector-add/8473aa30",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/resolv.conf",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/etc/hostname",
      "type": "bind",
      "source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/hostname",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/etc/hosts",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/etc-hosts",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/dev/shm",
      "type": "bind",
      "source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/mounts/shm",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/volumes/kubernetes.io~secret/default-token-thv9d",
      "options": [
        "rbind",
        "ro",
        "rprivate"
      ]
    }
  ],
  "hooks": {
    "prestart": [
      {
        "path": "/usr/bin/nvidia-container-runtime-hook",
        "args": [
          "/usr/bin/nvidia-container-runtime-hook",
          "prestart"
        ]
      }
    ]
  },
  "linux": {
    "resources": {
      "devices": [
        {
          "allow": false,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 5,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 3,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 9,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 8,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 0,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 1,
          "access": "rwm"
        },
        {
          "allow": false,
          "type": "c",
          "major": 10,
          "minor": 229,
          "access": "rwm"
        }
      ],
      "memory": {
        "disableOOMKiller": false
      },
      "cpu": {
        "shares": 2,
        "period": 100000
      },
      "blockIO": {
        "weight": 0
      }
    },
    "cgroupsPath": "kubepods-besteffort-pod685974b9_5eb0_11ed_bada_001eb9697543.slice:docker:664e21c310b62b2e1c3537388127812c7e2f482cb5cf40fa52280e3b62cf2646",
    "namespaces": [
      {
        "type": "mount"
      },
      {
        "type": "network",
        "path": "/proc/27057/ns/net"
      },
      {
        "type": "uts"
      },
      {
        "type": "pid"
      },
      {
        "type": "ipc",
        "path": "/proc/27057/ns/ipc"
      }
    ],
    "maskedPaths": [
      "/proc/acpi",
      "/proc/kcore",
      "/proc/keys",
      "/proc/latency_stats",
      "/proc/timer_list",
      "/proc/timer_stats",
      "/proc/sched_debug",
      "/proc/scsi",
      "/sys/firmware"
    ],
    "readonlyPaths": [
      "/proc/asound",
      "/proc/bus",
      "/proc/fs",
      "/proc/irq",
      "/proc/sys",
      "/proc/sysrq-trigger"
    ]
  }
}

adding NVIDIA_VISIBLE_DEVICES=all to Pod YAML fixes it as it described here

apiVersion: v1
kind: Pod
metadata:
  namespace: text-detector
  name: gpu-test-bald
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "registry.k.mycompany.com/experimental/cuda-vector-add:v0.1"
      imagePullPolicy: IfNotPresent
      env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: "all"
      resources:
        requests:
          aliyun.com/gpu-mem: "153"
        limits:
          aliyun.com/gpu-mem: "153"
  nodeName: gpu-node10
  tolerations:
    - operator: "Exists"

OCI

{
  "ociVersion": "1.0.1-dev",
  "process": {
    "user": {
      "uid": 0,
      "gid": 0
    },
    "args": [
      "/bin/sh",
      "-c",
      "./vectorAdd"
    ],
    "env": [
      "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "HOSTNAME=gpu-test-bald",
      "ALIYUN_COM_GPU_MEM_DEV=32768",
      "NVIDIA_VISIBLE_DEVICES=no-gpu-has-153MiB-to-run",    <----------Here it is
      "ALIYUN_COM_GPU_MEM_IDX=-1",
      "ALIYUN_COM_GPU_MEM_POD=153",
      "ALIYUN_COM_GPU_MEM_CONTAINER=153",
      "NVIDIA_VISIBLE_DEVICES=all",    <-------------------Here it is
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP=tcp://10.62.55.112:8890",
      "KUBERNETES_SERVICE_PORT_HTTPS=443",
      "KUBERNETES_PORT=tcp://10.62.0.1:443",
      "TEXT_DETECTOR_STAGING_SERVICE_HOST=10.62.55.112",
      "TEXT_DETECTOR_STAGING_SERVICE_PORT=8890",
      "KUBERNETES_SERVICE_PORT=443",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_ADDR=10.62.55.112",
      "KUBERNETES_SERVICE_HOST=10.62.0.1",
      "KUBERNETES_PORT_443_TCP=tcp://10.62.0.1:443",
      "TEXT_DETECTOR_STAGING_PORT=tcp://10.62.55.112:8890",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_PORT=8890",
      "KUBERNETES_PORT_443_TCP_PROTO=tcp",
      "KUBERNETES_PORT_443_TCP_PORT=443",
      "KUBERNETES_PORT_443_TCP_ADDR=10.62.0.1",
      "TEXT_DETECTOR_STAGING_SERVICE_PORT_HTTP=8890",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_PROTO=tcp",
      "CUDA_VERSION=8.0.61",
      "CUDA_PKG_VERSION=8-0=8.0.61-1",
      "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
      "LIBRARY_PATH=/usr/local/cuda/lib64/stubs:"
    ],
    ...

Now the same Pod has been successfully created and completed

NAME                                     READY   STATUS      RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
gpu-test-bald                            0/1     Completed   0          3m40s   10.62.97.59   gpu-node10   <none>           <none>

$ kubectl -f gpu-test-bald
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

So could you explain is such a behaviour of NVIDIA_VISIBLE_DEVICES ENV VAR correct? Seems like it is not

Nov 08 '22 08:11 k0nstantinv

gpushare-device-plugin gpushare-device-plugin copied to clipboard

NVIDIA_VISIBLE_DEVICES wrong value in OCI spec

gpushare-device-plugin
gpushare-device-plugin copied to clipboard