k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Resource type labelling is incomplete/incorrect

Open anaconda2196 opened this issue 3 years ago • 20 comments

Hi,

I am using nvidia-plugin version - v0.7.0 gpu-feature-discovery version - v0.4.1 k8s version - 1.20.2

on my A100 gpu machine:


nvidia-docker version
NVIDIA Docker: 2.6.0

nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0     13       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      5       MIG 2g.10gb          1         0          0:2     |
+--------------------------------------------------------------------+
|   0      1       MIG 3g.20gb          2         0          0:3     |
+--------------------------------------------------------------------+

 cat /etc/docker/daemon.json
{
"log-driver":"json-file",
"log-opts": { "max-size" : "10m", "max-file" : "10" }
, "runtimes": { "nvidia": { "path": "/usr\/bin\/nvidia-container-runtime","runtimeArgs": []}}
, "default-runtime" : "nvidia" 
}

I am able to successfully deployed gpu-feature-discovery pods as well as nvidia-plugin pods.

kubectl -n kube-system get pods
NAME                                                          READY   STATUS    RESTARTS   AGE

gpu-feature-discovery-6rdc8                                   1/1     Running   0          24m

nfd-master-6dd87d999-4xlrw                                    1/1     Running   0          24m
nfd-worker-rrwms                                              1/1     Running   0          24m
nvidia-device-plugin-fsq2j                                    1/1     Running   0          24m
nvidiagpubeat-59jd5                                           1/1     Running   0          88m

labels have been applied for this MIG strategy: mixed on my A100 gpu as below:

kubectl get node GPU_NODE -o yaml 

labels:
....
...
    nvidia.com/cuda.driver.major: "450"
    nvidia.com/cuda.driver.minor: "80"
    nvidia.com/cuda.driver.rev: "02"
    nvidia.com/cuda.runtime.major: "11"
    nvidia.com/cuda.runtime.minor: "0"
    nvidia.com/gfd.timestamp: "1626160093"
    nvidia.com/gpu.compute.major: "8"
    nvidia.com/gpu.compute.minor: "0"
    nvidia.com/gpu.count: "1"
    nvidia.com/gpu.family: ampere
    nvidia.com/gpu.machine: ProLiant-DL380-Gen10
    nvidia.com/gpu.memory: "40537"
    nvidia.com/gpu.product: A100-PCIE-40GB
    nvidia.com/mig-1g.5gb.count: "1"
    nvidia.com/mig-1g.5gb.engines.copy: "1"
    nvidia.com/mig-1g.5gb.engines.decoder: "0"
    nvidia.com/mig-1g.5gb.engines.encoder: "0"
    nvidia.com/mig-1g.5gb.engines.jpeg: "0"
    nvidia.com/mig-1g.5gb.engines.ofa: "0"
    nvidia.com/mig-1g.5gb.memory: "4864"
    nvidia.com/mig-1g.5gb.multiprocessors: "14"
    nvidia.com/mig-1g.5gb.slices.ci: "1"
    nvidia.com/mig-1g.5gb.slices.gi: "1"
    nvidia.com/mig-2g.10gb.count: "1"
    nvidia.com/mig-2g.10gb.engines.copy: "2"
    nvidia.com/mig-2g.10gb.engines.decoder: "1"
    nvidia.com/mig-2g.10gb.engines.encoder: "0"
    nvidia.com/mig-2g.10gb.engines.jpeg: "0"
    nvidia.com/mig-2g.10gb.engines.ofa: "0"
    nvidia.com/mig-2g.10gb.memory: "9984"
    nvidia.com/mig-2g.10gb.multiprocessors: "28"
    nvidia.com/mig-2g.10gb.slices.ci: "2"
    nvidia.com/mig-2g.10gb.slices.gi: "2"
    nvidia.com/mig-3g.20gb.count: "1"
    nvidia.com/mig-3g.20gb.engines.copy: "3"
    nvidia.com/mig-3g.20gb.engines.decoder: "2"
    nvidia.com/mig-3g.20gb.engines.encoder: "0"
    nvidia.com/mig-3g.20gb.engines.jpeg: "0"
    nvidia.com/mig-3g.20gb.engines.ofa: "0"
    nvidia.com/mig-3g.20gb.memory: "20096"
    nvidia.com/mig-3g.20gb.multiprocessors: "42"
    nvidia.com/mig-3g.20gb.slices.ci: "3"
    nvidia.com/mig-3g.20gb.slices.gi: "3"
    nvidia.com/mig.strategy: mixed
...
...

gpu-feature-discovery pod is working correctly However problem is with resource type on my A100 gpu node.

I would expect to get as below:

kubectl describe node
...
Capacity:
nvidia.com/mig-1g.5gb:   1
nvidia.com/mig-2g.10gb:  1
nvidia.com/mig-3g.20gb:  1
...
Allocatable:
nvidia.com/mig-1g.5gb:   1
nvidia.com/mig-2g.10gb:  1
nvidia.com/mig-3g.20gb:  1

I am getting

kubectl describe node
...
Capacity:
nvidia.com/gpu: 0
...
Allocatable:
nvidia.com/gpu: 0

...


Also, getting below error while checking nvidia-plugin logs.

kubectl -n kube-system logs nvidia-device-plugin-fsq2j 2021/07/14 22:45:43 Loading NVML 2021/07/14 22:45:43 Starting FS watcher. 2021/07/14 22:45:43 Starting OS watcher. 2021/07/14 22:45:43 Retreiving plugins. 2021/07/14 22:45:43 No devices found. Waiting indefinitely.

I am going through docs but not able to figured it out what is the issue going on. Any help would be much appreciated.

Thank you

anaconda2196 avatar Jul 14 '21 23:07 anaconda2196

Looks like this is the similar issue of https://github.com/NVIDIA/k8s-device-plugin/issues/192. @klueska

anaconda2196 avatar Jul 14 '21 23:07 anaconda2196

with migstrategy=single checked with both version - v0.7.0 | v.0.9.0

After upgrading drivers

nvidia-smi
Thu Jul 15 16:40:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:86:00.0 Off |                   On |
| N/A   56C    P0    35W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
                                1.4.0-1                        @libnvidia-container
libnvidia-container1.x86_64     1.4.0-1                        @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1                        @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2                        @nvidia-container-runtime
nvidia-docker2.noarch           2.6.0-1                        @nvidia-docker   

nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances:                                     |
| GPU   Name          Profile  Instance   Placement  |
|                       ID       ID       Start:Size |
|====================================================|
|   0  MIG 1g.5gb       19        7          4:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        8          5:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        9          6:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       11          0:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       12          1:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       13          2:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       14          3:1     |
+----------------------------------------------------+

nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      7       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      8       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      9       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     11       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     12       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     13       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     14       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+

gpu-feature-discovery pod running correctly and applied correct labels to A100 GPU node whether if migstrategy=single / mixed.

Problem with nvidia-plugin pod - crashingoff

v0.9.0

kubectl -n kube-system logs nvidia-device-plugin-xgv7t
2021/07/15 23:49:47 Loading NVML
2021/07/15 23:49:47 Starting FS watcher.
2021/07/15 23:49:47 Starting OS watcher.
2021/07/15 23:49:47 Retreiving plugins.
2021/07/15 23:49:47 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
	/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc42016eec0, 0x0, 0x0)
	/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc4202e2000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc4202e2000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc4201fbf50)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
	/go/src/nvidia-device-plugin/main.go:88 +0x751


v0.7.0

kubectl -n kube-system logs nvidia-device-plugin-jl85z
2021/07/15 23:57:37 Loading NVML
2021/07/15 23:57:37 Starting FS watcher.
2021/07/15 23:57:37 Starting OS watcher.
2021/07/15 23:57:37 Retreiving plugins.
2021/07/15 23:57:37 Shutdown of NVML returned: <nil>
panic: No MIG devices present on node

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0xfdbb58, 0x6, 0xa9a700, 0xfdbb58)
	/go/src/nvidia-device-plugin/mig-strategy.go:115 +0x43f
main.main()
	/go/src/nvidia-device-plugin/main.go:103 +0x413

anaconda2196 avatar Jul 16 '21 00:07 anaconda2196

Hi @anaconda2196 could you attach the node labels detected by gpu-feature-discovery? Too keep it simple, let's consider the mig-strategy=single case.

elezar avatar Jul 16 '21 11:07 elezar

@anaconda2196 I noted from the nvidia-smi output that you have persistence mode disabled. Would it be possible to see what effect enabling persistence mode has on this?

elezar avatar Jul 16 '21 11:07 elezar

Hi @elezar

I have enabled persistence mode on my A100 gpu node, but no luck with same error.

nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Fri Jul 16 07:13:16 2021
Driver Version                            : 460.73.01
CUDA Version                              : 11.2

Attached GPUs                             : 1
GPU 00000000:86:00.0
    Product Name                          : A100-PCIE-40GB
    Product Brand                         : NVIDIA
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : Enabled
        Pending                           : Enabled
    MIG Device
        Index                             : 0
        GPU Instance ID                   : 7
        Compute Instance ID               : 0
        Device Attributes
            Shared
                Multiprocessor count      : 14
                Copy Engine count         : 1
                Encoder count             : 0
                Decoder count             : 0
                OFA count                 : 0
                JPG count                 : 0

nvidia-smi
Fri Jul 16 07:27:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:86:00.0 Off |                   On |
| N/A   47C    P0    33W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

node labels:

node-role.kubernetes.io/worker=
                    nvidia.com/cuda.driver.major=460
                    nvidia.com/cuda.driver.minor=73
                    nvidia.com/cuda.driver.rev=01
                    nvidia.com/cuda.runtime.major=11
                    nvidia.com/cuda.runtime.minor=2
                    nvidia.com/gfd.timestamp=1626445770
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=7
                    nvidia.com/gpu.engines.copy=1
                    nvidia.com/gpu.engines.decoder=0
                    nvidia.com/gpu.engines.encoder=0
                    nvidia.com/gpu.engines.jpeg=0
                    nvidia.com/gpu.engines.ofa=0
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=ProLiant-DL380-Gen10
                    nvidia.com/gpu.memory=4864
                    nvidia.com/gpu.multiprocessors=14
                    nvidia.com/gpu.product=A100-PCIE-40GB-MIG-1g.5gb
                    nvidia.com/gpu.slices.ci=1
                    nvidia.com/gpu.slices.gi=1
                    nvidia.com/mig.strategy=single
$ kubectl -n kube-system get pods
NAME                                                          READY   STATUS             RESTARTS   AGE

gpu-feature-discovery-mhd6c                                   1/1     Running            0          6m50s

nfd-master-6dd87d999-xmg7z                                    1/1     Running            0          6m50s
nfd-worker-qm77s                                              1/1     Running            0          6m50s
nvidia-device-plugin-sss4g                                    0/1     CrashLoopBackOff   6          6m53s

$ kubectl -n kube-system logs nvidia-device-plugin-sss4g
2021/07/16 14:35:32 Loading NVML
2021/07/16 14:35:32 Starting FS watcher.
2021/07/16 14:35:32 Starting OS watcher.
2021/07/16 14:35:32 Retreiving plugins.
2021/07/16 14:35:32 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
	/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc4201b4e80, 0x0, 0x0)
	/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc42018ec00, 0xae5a40, 0xc4201a4010, 0xc4201b6000, 0x7, 0x7, 0x0, 0x0)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc42018ec00, 0xc4201b6000, 0x7, 0x7, 0x4567e0, 0xc420221f50)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
	/go/src/nvidia-device-plugin/main.go:88 +0x751

anaconda2196 avatar Jul 16 '21 14:07 anaconda2196

This is very strange behaviour. The underlying code to detect and enumerate the various MIG devices is shared between gpu-feature-discovery and the k8s-device-plugin. We’ve also not had similar reports to this from anyone else running the same versions of everything you’ve listed here.

Can you update the plugin podspec to ignore the failure on the plugin executable itself and then run a ‚sleep forever‘?

And once you’ve done that, exec into the container and run nvidia-smi.

klueska avatar Jul 16 '21 16:07 klueska

Hi @klueska @elezar weird. inside pod if I run nvidia-smi then it is showing no mig devices.

nvidia-device-plugin-qffmb                                    1/1     Running   0          16s

Abhisheks-MacBook-Pro:~ abhishekacharya$ kubectl -n kube-system logs nvidia-device-plugin-qffmb
Abhisheks-MacBook-Pro:~ abhishekacharya$ kubectl -n kube-system exec -it nvidia-device-plugin-qffmb -- bash
root@nvidia-device-plugin-qffmb:/# 
root@nvidia-device-plugin-qffmb:/# 
root@nvidia-device-plugin-qffmb:/# nvidia-smi
Fri Jul 16 17:51:35 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:86:00.0 Off |                   On |
| N/A   48C    P0    33W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

anaconda2196 avatar Jul 16 '21 17:07 anaconda2196

Can you show your podspec?

klueska avatar Jul 16 '21 18:07 klueska

kubectl -n kube-system describe pod nvidia-device-plugin-qffmb
Name:                 nvidia-device-plugin-qffmb
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 xxxx--NODE-NAME--xxx
Start Time:           Fri, 16 Jul 2021 10:50:53 -0700
Labels:               app.kubernetes.io/instance=nvidia-device-plugin
                      app.kubernetes.io/name=nvidia-device-plugin
                      controller-revision-hash=d9466b556
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/podIP: 10.192.0.227/32
                      cni.projectcalico.org/podIPs: 10.192.0.227/32
                      kubernetes.io/psp: hcp-psp-privileged
                      scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.192.0.227
IPs:
  IP:           10.192.0.227
Controlled By:  DaemonSet/nvidia-device-plugin
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  docker://4d93a5fe0c44b2452cd0f05beed87503e985e2dbd884735dda66165b4bd3ac71
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.9.0
    Image ID:      docker-pullable://nvidia/k8s-device-plugin@sha256:964847cc3fd85ead286be1d74d961f53d638cd4875af51166178b17bba90192f
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      sleep infinity
    Args:
      --mig-strategy=single
      --pass-device-specs=true
      --fail-on-init-error=true
      --device-list-strategy=envvar
      --device-id-strategy=uuid
      --nvidia-driver-root=/
    State:          Running
      Started:      Fri, 16 Jul 2021 10:50:57 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-khb4p (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  default-token-khb4p:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-khb4p
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     CriticalAddonsOnly op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
                 nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  24m   default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-qffmb to xxxx--NODE-NAME--xxx
  Normal  Pulling    24m   kubelet            Pulling image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
  Normal  Pulled     24m   kubelet            Successfully pulled image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0" in 2.299548038s
  Normal  Created    24m   kubelet            Created container nvidia-device-plugin-ctr
  Normal  Started    24m   kubelet            Started container nvidia-device-plugin-ctr

On A100 GPU machine - mig devices are already enabled.

nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)


But,

two things I have notice if I exec into pod

root@nvidia-device-plugin-qffmb:/var/lib/kubelet/device-plugins# ls
DEPRECATION  kubelet.sock  kubelet_internal_checkpoint

/proc/driver/nvidia sub-dirs are not mounted inside the container. On my A100 gpu machine

[root@xxx]# ls /proc/driver/nvidia

capabilities  gpus  params  patches  registry  suspend  suspend_depth  version  warnings

 
Inside the container:

root@nvidia-device-plugin-qffmb:/# ls /proc/driver/nvidia      

gpus  params  registry  version


anaconda2196 avatar Jul 16 '21 18:07 anaconda2196

If libnvidia-container doesn’t think you want access to MIG devices, it won’t inject these proc files (or the dev nodes they point to). When it comes to the plugin, this typically happens if you set NVIDIA_VISIBLE_DEVICES to something other than ‚all‘. That’s why I wanted to see your pod spec but it doesn’t seem you’ve overridden this.

Can you show us the value just to be sure.

klueska avatar Jul 16 '21 18:07 klueska

values.yaml

legacyDaemonsetAPI: false
compatWithCPUManager: false
migStrategy: none
failOnInitError: true
deviceListStrategy: envvar
deviceIDStrategy: uuid
nvidiaDriverRoot: "/"

nameOverride: ""
fullnameOverride: ""
selectorLabelsOverride: {}

namespace: kube-system

imagePullSecrets: []
image:
  repository: nvcr.io/nvidia/k8s-device-plugin
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: ""

updateStrategy:
  type: RollingUpdate

podSecurityContext: {}
securityContext: {}

resources: {}
nodeSelector: {}
affinity: {}
tolerations:
  # This toleration is deprecated. Kept here for backward compatibility
  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

my daemonset.yaml

spec:
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
      - image: {{ include "nvidia-device-plugin.fullimage" . }}
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        name: nvidia-device-plugin-ctr
        command: ["sh", "-c", "sleep infinity"]
        args:
        - "--mig-strategy={{ .Values.migStrategy }}"
        - "--pass-device-specs={{ .Values.compatWithCPUManager }}"
        - "--fail-on-init-error={{ .Values.failOnInitError }}"
        - "--device-list-strategy={{ .Values.deviceListStrategy }}"
        - "--device-id-strategy={{ .Values.deviceIDStrategy }}"
        - "--nvidia-driver-root={{ .Values.nvidiaDriverRoot }}"
        securityContext:
        {{- if ne (len .Values.securityContext) 0 }}
          {{- toYaml .Values.securityContext | nindent 10 }}
        {{- else if .Values.compatWithCPUManager }}
          privileged: true
        {{- else }}
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        {{- end }}
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
        {{- with .Values.resources }}
        resources:
          {{- toYaml . | nindent 10 }}
        {{- end }}
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}

anaconda2196 avatar Jul 16 '21 19:07 anaconda2196

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - MY-GPU-NODE
  containers:
  - args:
    - --mig-strategy=single
    - --pass-device-specs=true
    - --fail-on-init-error=true
    - --device-list-strategy=envvar
    - --device-id-strategy=uuid
    - --nvidia-driver-root=/
    command:
    - sh
    - -c
    - sleep infinity
    image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
    imagePullPolicy: Always
    name: nvidia-device-plugin-ctr
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet/device-plugins
      name: device-plugin
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-khb4p
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: mip-bd-dev80.mip.storage.hpecorp.net
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/lib/kubelet/device-plugins
      type: ""
    name: device-plugin
  - name: default-token-khb4p
    secret:
      defaultMode: 420
      secretName: default-token-khb4p
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-07-16T17:50:53Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-07-16T17:50:57Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-07-16T17:50:57Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-07-16T17:50:53Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://4d93a5fe0c44b2452cd0f05beed87503e985e2dbd884735dda66165b4bd3ac71
    image: nvidia/k8s-device-plugin:v0.9.0
    imageID: docker-pullable://nvidia/k8s-device-plugin@sha256:964847cc3fd85ead286be1d74d961f53d638cd4875af51166178b17bba90192f
    lastState: {}
    name: nvidia-device-plugin-ctr
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-07-16T17:50:57Z"
  hostIP: 16.0.13.204
  phase: Running
  podIP: 10.192.0.227
  podIPs:
  - ip: 10.192.0.227
  qosClass: BestEffort
  startTime: "2021-07-16T17:50:53Z"


kubectl -n kube-system exec -it nvidia-device-plugin-qffmb -- bash
root@nvidia-device-plugin-qffmb:/# echo $NVIDIA_DISABLE_REQUIRE
true

anaconda2196 avatar Jul 16 '21 19:07 anaconda2196

Sorry. I just meant exec into the container and show the value of NVIDIA_VISIBLE_DEVICES environment variable.

klueska avatar Jul 16 '21 19:07 klueska

root@nvidia-device-plugin-mr26n:/# echo $NVIDIA_VISIBLE_DEVICES
all

anaconda2196 avatar Jul 16 '21 19:07 anaconda2196

Hi @elezar @klueska Finally, I got the solution - https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_MIG_User_Guide.pdf

“Toggling MIG mode requires the CAP_SYS_ADMIN capability. Other MIG management, such as creating and destroying instances, requires superuser by default, but can be delegated to non privileged users by adjusting permissions to MIG capabilities in /proc/” - page:12.

https://github.com/NVIDIA/nvidia-container-runtime

set env NVIDIA_MIG_CONFIG_DEVICES = all

Thank you so much for your support.

PS - Currently I have only tested with mig-strategy = single need to test with mixed and will update the ticket soon!

anaconda2196 avatar Jul 19 '21 06:07 anaconda2196

Glad to hear it’s working now, but you shouldn‘t need to set this on the plugin. Im still curious what is different about your setup that is requiring this.

klueska avatar Jul 19 '21 07:07 klueska

Yes, There is little bit confusion. Because, I have removed my A100 GPU machine from my setup and install fresh OS - Centos79 on it. Then, I have followed this document - https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

Basically, I have created mig devices (mig-strategy=single) and installed docker, nvidia-driver, nvidia-container-toolki as described in the documentation.

yum list installed | grep nvidia
libnvidia-container-tools.x86_64   1.4.0-1                        @libnvidia-container
libnvidia-container1.x86_64        1.4.0-1                        @libnvidia-container
nvidia-container-runtime.x86_64    3.5.0-1                        @nvidia-container-runtime
nvidia-container-toolkit.x86_64    1.5.1-2                        @nvidia-container-runtime
nvidia-docker2.noarch              2.6.0-1                        @nvidia-docker

Strange behaviour is that if i simple run -

docker run --rm --gpus=all  nvidia/cuda:11.0-base nvidia-smi
Mon Jul 19 18:02:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:86:00.0 Off |                   On |
| N/A   73C    P0    56W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


but when I pass parameters and env variable then it works.

docker run --rm --gpus=all --cap-add SYS_ADMIN -e NVIDIA_MIG_CONFIG_DEVICES="all"  nvidia/cuda:11.0-base nvidia-smi
Mon Jul 19 18:03:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:86:00.0 Off |                   On |
| N/A   73C    P0    57W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   3  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   4  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   5  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   6  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

So, now it is unclear as per provided docs. Because, I have to provide this parameter and env variable in my current setup to able to deploy nvidia-plugin pod. Can you check it - https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker?

Thank you for your time in advance.

anaconda2196 avatar Jul 19 '21 18:07 anaconda2196

Hi @elezar @klueska

Here, one more issue I am facing, I am able deploy nvidia-plugin , gpu-feature-discovery pod successfully on A100 GPU node with mig-strategy=single

A100 GPU | mig-strategy=single | Centos7 | kubernetes version - 1.20.2 | Docker version 20.10.7

yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
                                1.4.0-1                        @libnvidia-container
libnvidia-container1.x86_64     1.4.0-1                        @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1                        @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2                        @nvidia-container-runtime
nvidia-docker2.noarch           2.6.0-1                        @nvidia-docker   

# cat /etc/docker/daemon.json
{
"log-driver":"json-file",
"log-opts": { "max-size" : "10m", "max-file" : "10" }
, "runtimes": { "nvidia": { "path": "/usr\/bin\/nvidia-container-runtime","runtimeArgs": []}}
, "default-runtime" : "nvidia" 
}

# nvidia-docker version
NVIDIA Docker: 2.6.0
/usr/bin/nvidia-docker: line 34: /usr/bin/docker: Permission denied
/usr/bin/nvidia-docker: line 34: /usr/bin/docker: Success

# nvidia-smi
Fri Jul 23 09:39:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:86:00.0 Off |                   On |
| N/A   53C    P0    35W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)


kubectl -n kube-system get pods
NAME                                                          READY   STATUS    RESTARTS   AGE

gpu-feature-discovery-single-9mm7x                            1/1     Running   0          15h

nfd-master-dd9568c97-lt5th                                    1/1     Running   0          17h
nfd-worker-5tlhj                                              1/1     Running   0          15h
nfd-worker-ftr22                                              1/1     Running   0          17h
nvidia-device-plugin-single-ct2v6                             1/1     Running   0          15h
nvidiagpubeat-xrjsv                                           1/1     Running   0          15h

$ kubectl -n kube-system logs nvidia-device-plugin-single-ct2v6
2021/07/23 01:14:02 Loading NVML
2021/07/23 01:14:02 Starting FS watcher.
2021/07/23 01:14:02 Starting OS watcher.
2021/07/23 01:14:02 Retreiving plugins.
2021/07/23 01:14:02 Starting GRPC server for 'nvidia.com/gpu'
2021/07/23 01:14:02 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/07/23 01:14:02 Registered device plugin for 'nvidia.com/gpu' with Kubelet

Following - https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html

For single strategy - point #7 - Deploy 7 pods, each consuming one MIG device (then read their logs and delete them)

It is not detecting mig-devices

for i in $(seq 7); do echo "mig-single-example-${i}"; kubectl logs mig-single-example-${i}; echo ""; done mig-single-example-1

mig-single-example-2

mig-single-example-3

mig-single-example-4

mig-single-example-5

mig-single-example-6

mig-single-example-7

When I do kubectl describe pod -

Error: failed to start container "mig-single-example-1": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0: unknown device: unknown

Can you help me with that? which library or any configuration setup if I am still missing.

However interesting fact is that: if I pass parameter with privileged true and env variable then it work, but I don't want to do that.

$for i in $(seq 7); do kubectl run --privileged=true --image=nvidia/cuda:11.0-base --env="NVIDIA_VISIBLE_DEVICES=all" --env="NVIDIA_MIG_CONFIG_DEVICES=all" --restart=Never --limits=nvidia.com/gpu=1 mig-single-example-${i} -- bash -c "nvidia-smi -L; sleep infinity"; done
pod/mig-single-example-1 created
pod/mig-single-example-2 created
pod/mig-single-example-3 created
pod/mig-single-example-4 created
pod/mig-single-example-5 created
pod/mig-single-example-6 created
pod/mig-single-example-7 created

$ kubectl -n default get pods
NAME                   READY   STATUS    RESTARTS   AGE
mig-single-example-1   1/1     Running   0          17s
mig-single-example-2   1/1     Running   0          17s
mig-single-example-3   1/1     Running   0          16s
mig-single-example-4   1/1     Running   0          16s
mig-single-example-5   1/1     Running   0          15s
mig-single-example-6   1/1     Running   0          15s
mig-single-example-7   1/1     Running   0          14s

$ for i in $(seq 7); do
> echo "mig-single-example-${i}";
> kubectl logs mig-single-example-${i}
> echo "";
> done
mig-single-example-1
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)
...
..
......

anaconda2196 avatar Jul 23 '21 16:07 anaconda2196

@klueska @elezar Okay, It worked for me on OS Suse (Sles15sp2) but didn't work on Centos. I have checked and compared config.toml for both OS, everything is same except user = "root:video". In Suse (Sles15sp2) in config.toml user = "root:video" uncommented while in Centos in config.toml user = "root:video" is commented.

So, I have manually changed config.toml and uncommented user = "root:video". Then it worked for centos too.

Also, I came across this - https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/config/config.toml.centos https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/config/config.toml.opensuse-leap

Can you help me to understand this - why this is the difference for both OS in config.toml? Also, What is the possible solution for this that I don't have to change manually in Centos in config.toml.

Thank you and in truly appreciate your advice and help.

anaconda2196 avatar Jul 26 '21 09:07 anaconda2196

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]