k8s-device-plugin
k8s-device-plugin copied to clipboard
Installation failed k8s-device-plugin(v0.9.0)
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
I could install k8s-device-plugin(v.0.7.3), but i try to upgrade v.0.9.0, then the errors occur
2. Information to attach (optional if deemed irrelevant)
Common error checking:
- [ ] The k8s-device-plugin container logs
2021/06/07 07:38:35 Loading NVML
2021/06/07 07:38:35 Starting FS watcher.
2021/06/07 07:38:35 Starting OS watcher.
2021/06/07 07:38:35 Retreiving plugins.
2021/06/07 07:38:35 Fatal: missing MIG GPU instance capability path: /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
2021/06/07 07:38:35 Shutdown of NVML returned:
panic: Fatal: missing MIG GPU instance capability path: /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
goroutine 1 [running]: log.Panicln(0xc42057b910, 0x2, 0x2) /usr/local/go/src/log/log.go:340 +0xc0 main.check(0xadec60, 0xc420481000) /go/src/nvidia-device-plugin/nvidia.go:61 +0x81 main.(*MigDeviceManager).Devices(0xc42000c500, 0x0, 0x0, 0x0) /go/src/nvidia-device-plugin/nvidia.go:129 +0x287 main.start(0xc4202c0ec0, 0x0, 0x0) /go/src/nvidia-device-plugin/main.go:155 +0x5d1 nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc420432000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0) /go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8 nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc420432000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc42034df50) /go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61 main.main() /go/src/nvidia-device-plugin/main.go:88 +0x751
Additional information that might help better understand your environment and reproduce the bug:
Kubernetes version is below Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
@Kwonho could you describe your setup a little bit more clearly? The code path for the error you are seeing should only be triggered if one (or more) of the devices on your system are configured with MIG mode enabled and a mig.strategy other than mig.strategy=none is configured.
If you are using this in "standalone" mode (i.e. without the GPU operator), it may be that the underlying NVIDIA Container Toolkit components also need to be updated.
@elezar I used DGX-A100 for MIG test. When i try to install v0.7.3, there is no problem. But i try to install v0.9.0 same way, errors are occured. If you need more information, please let me know. Thanks
Are you using the GPU-operator? Or is this a standard device plugin install?
Did you update the NVIDIA Container Runtime components as part of updating to 0.9.0? Which versions of libnvidia-container, nvidia-container-runtime, nvidia-container-toolkit, and nvidia-docker2 are installed (if any)?
If I recall correctly, there was a change in libnvidia-container 1.4.0 that was required due to how the /proc/driver/nvidia folder was being managed by the driver. This may be what we're seeing here.
I using standard device plugin install (helm or yaml)
And the Runtime component's version belows. ii libnvidia-container1:amd64 1.1.0-1 ii nvidia-container-runtime 3.1.4-1 ii nvidia-container-toolkit 1.0.6-1 ii nvidia-docker2 2.2.2-1
Could you update nvidia-docker2 to 2.6.0? This should pull in the other dependencies.
I will create a ticket to track adding this requirement to the documentation.
I am also facing issue while deploying nvidia device plugin -v0.9.0
A100 GPU - mig enable
nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
MIG 3g.20gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/1/0)
MIG 2g.10gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/5/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
kubectl nvidia-plugin logs
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684
goroutine 1 [running]:
main.(*migStrategyMixed).GetPlugins(0x1042638, 0x5, 0xae1140, 0x1042638)
/go/src/nvidia-device-plugin/mig-strategy.go:171 +0xa41
main.start(0xc4201a6e80, 0x0, 0x0)
/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc42017b080, 0xae5a40, 0xc42019e010, 0xc4201a8000, 0x7, 0x7, 0x0, 0x0)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc42017b080, 0xc4201a8000, 0x7, 0x7, 0x4567e0, 0xc420211f50)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
/go/src/nvidia-device-plugin/main.go:88 +0x751
Hi @anaconda2196. Is there only a single device in the host?
Which version of the CUDA driver and CUDA Container Toolkit (nvidia-docker) do you have installed? See https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html#mig-support-in-kubernetes
Hi @elezar
k8s version - 1.20.2
If I tried with mig strategy: sigle then also I am facing same issue not only for nvidia-plugin version v.0.9.0 but also for v0.7.0 see (https://github.com/NVIDIA/k8s-device-plugin/issues/257)
yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
1.4.0-1 @libnvidia-container
libnvidia-container1.x86_64 1.4.0-1 @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1 @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2 @nvidia-container-runtime
nvidia-docker2.noarch 2.6.0-1 @nvidia-docker
migstrategy - single
# nvidia-smi
Thu Jul 15 13:58:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:86:00.0 Off | On |
| N/A 53C P0 34W / 250W | 25MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 8 0 1 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 14 0 6 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)
# nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|====================================================|
| 0 MIG 1g.5gb 19 7 4:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 8 5:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 9 6:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 11 0:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 12 1:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 13 2:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 14 3:1 |
+----------------------------------------------------+
# nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 0 7 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 8 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 9 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 11 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 12 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 13 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 14 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
kubectl get node GPU-NODE -o yaml
looks like gpe-feature-discovery pods running correctly and assigned labels to the A100 gpu node.
labels:
...
nvidia.com/cuda.driver.major=450
nvidia.com/cuda.driver.minor=80
nvidia.com/cuda.driver.rev=02
nvidia.com/cuda.runtime.major=11
nvidia.com/cuda.runtime.minor=0
nvidia.com/gfd.timestamp=1626381819
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=0
nvidia.com/gpu.count=7
nvidia.com/gpu.engines.copy=1
nvidia.com/gpu.engines.decoder=0
nvidia.com/gpu.engines.encoder=0
nvidia.com/gpu.engines.jpeg=0
nvidia.com/gpu.engines.ofa=0
nvidia.com/gpu.family=ampere
nvidia.com/gpu.machine=ProLiant-DL380-Gen10
nvidia.com/gpu.memory=4864
nvidia.com/gpu.multiprocessors=14
nvidia.com/gpu.product=A100-PCIE-40GB-MIG-1g.5gb
nvidia.com/gpu.slices.ci=1
nvidia.com/gpu.slices.gi=1
nvidia.com/mig.strategy=single
...
kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-dfrwf 1/1 Running 0 6m18s
nfd-master-6dd87d999-spkqp 1/1 Running 0 6m33s
nfd-worker-w2wbf 1/1 Running 0 6m33s
nvidia-device-plugin-9p462 0/1 Error 6 6m37s
$ kubectl -n kube-system logs nvidia-device-plugin-9p462
2021/07/15 20:49:32 Loading NVML
2021/07/15 20:49:32 Starting FS watcher.
2021/07/15 20:49:32 Starting OS watcher.
2021/07/15 20:49:32 Retreiving plugins.
2021/07/15 20:49:32 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684
goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc4201acec0, 0x0, 0x0)
/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc42017ad80, 0xae5a40, 0xc4201a4010, 0xc4201ae000, 0x7, 0x7, 0x0, 0x0)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc42017ad80, 0xc4201ae000, 0x7, 0x7, 0x4567e0, 0xc420219f50)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
/go/src/nvidia-device-plugin/main.go:88 +0x751
Problem is with resource type on my A100 gpu node.
I am getting
kubectl describe node
...
Capacity:
nvidia.com/gpu: 0
...
Allocatable:
nvidia.com/gpu: 0
...
with migstrategy=single checked with both version - v0.7.0 | v.0.9.0
After upgrading drivers
nvidia-smi
Thu Jul 15 16:40:54 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:86:00.0 Off | On |
| N/A 56C P0 35W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 8 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 14 0 6 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
# yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
1.4.0-1 @libnvidia-container
libnvidia-container1.x86_64 1.4.0-1 @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1 @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2 @nvidia-container-runtime
nvidia-docker2.noarch 2.6.0-1 @nvidia-docker
nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|====================================================|
| 0 MIG 1g.5gb 19 7 4:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 8 5:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 9 6:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 11 0:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 12 1:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 13 2:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 14 3:1 |
+----------------------------------------------------+
nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 0 7 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 8 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 9 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 11 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 12 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 13 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 14 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
gpu-feature-discovery pod running correctly and applied correct labels to A100 GPU node whether if migstrategy=single / mixed.
Problem with nvidia-plugin pod - crashingoff
v0.9.0
kubectl -n kube-system logs nvidia-device-plugin-xgv7t
2021/07/15 23:49:47 Loading NVML
2021/07/15 23:49:47 Starting FS watcher.
2021/07/15 23:49:47 Starting OS watcher.
2021/07/15 23:49:47 Retreiving plugins.
2021/07/15 23:49:47 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684
goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc42016eec0, 0x0, 0x0)
/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc4202e2000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc4202e2000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc4201fbf50)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
/go/src/nvidia-device-plugin/main.go:88 +0x751
v0.7.0
kubectl -n kube-system logs nvidia-device-plugin-jl85z
2021/07/15 23:57:37 Loading NVML
2021/07/15 23:57:37 Starting FS watcher.
2021/07/15 23:57:37 Starting OS watcher.
2021/07/15 23:57:37 Retreiving plugins.
2021/07/15 23:57:37 Shutdown of NVML returned: <nil>
panic: No MIG devices present on node
goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0xfdbb58, 0x6, 0xa9a700, 0xfdbb58)
/go/src/nvidia-device-plugin/mig-strategy.go:115 +0x43f
main.main()
/go/src/nvidia-device-plugin/main.go:103 +0x413
Same here, crashlooping with 0.12.2
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0
Looks like this is a race condition issue. Having the label nvidia.com/mig.config set on the node in question should trigger the mig-manager allowing the device plugin to succeed.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.