k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Installation failed k8s-device-plugin(v0.9.0)

Open Kwonho opened this issue 4 years ago • 13 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

I could install k8s-device-plugin(v.0.7.3), but i try to upgrade v.0.9.0, then the errors occur

2. Information to attach (optional if deemed irrelevant)

Common error checking:

  • [ ] The k8s-device-plugin container logs 2021/06/07 07:38:35 Loading NVML 2021/06/07 07:38:35 Starting FS watcher. 2021/06/07 07:38:35 Starting OS watcher. 2021/06/07 07:38:35 Retreiving plugins. 2021/06/07 07:38:35 Fatal: missing MIG GPU instance capability path: /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access 2021/06/07 07:38:35 Shutdown of NVML returned: panic: Fatal: missing MIG GPU instance capability path: /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access

goroutine 1 [running]: log.Panicln(0xc42057b910, 0x2, 0x2) /usr/local/go/src/log/log.go:340 +0xc0 main.check(0xadec60, 0xc420481000) /go/src/nvidia-device-plugin/nvidia.go:61 +0x81 main.(*MigDeviceManager).Devices(0xc42000c500, 0x0, 0x0, 0x0) /go/src/nvidia-device-plugin/nvidia.go:129 +0x287 main.start(0xc4202c0ec0, 0x0, 0x0) /go/src/nvidia-device-plugin/main.go:155 +0x5d1 nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc420432000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0) /go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8 nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc420432000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc42034df50) /go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61 main.main() /go/src/nvidia-device-plugin/main.go:88 +0x751

Additional information that might help better understand your environment and reproduce the bug:

Kubernetes version is below Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

Kwonho avatar Jun 07 '21 07:06 Kwonho

@Kwonho could you describe your setup a little bit more clearly? The code path for the error you are seeing should only be triggered if one (or more) of the devices on your system are configured with MIG mode enabled and a mig.strategy other than mig.strategy=none is configured.

If you are using this in "standalone" mode (i.e. without the GPU operator), it may be that the underlying NVIDIA Container Toolkit components also need to be updated.

elezar avatar Jun 07 '21 12:06 elezar

@elezar I used DGX-A100 for MIG test. When i try to install v0.7.3, there is no problem. But i try to install v0.9.0 same way, errors are occured. If you need more information, please let me know. Thanks

Kwonho avatar Jun 07 '21 13:06 Kwonho

Are you using the GPU-operator? Or is this a standard device plugin install?

Did you update the NVIDIA Container Runtime components as part of updating to 0.9.0? Which versions of libnvidia-container, nvidia-container-runtime, nvidia-container-toolkit, and nvidia-docker2 are installed (if any)?

elezar avatar Jun 07 '21 13:06 elezar

If I recall correctly, there was a change in libnvidia-container 1.4.0 that was required due to how the /proc/driver/nvidia folder was being managed by the driver. This may be what we're seeing here.

elezar avatar Jun 07 '21 13:06 elezar

I using standard device plugin install (helm or yaml)

And the Runtime component's version belows. ii libnvidia-container1:amd64 1.1.0-1 ii nvidia-container-runtime 3.1.4-1 ii nvidia-container-toolkit 1.0.6-1 ii nvidia-docker2 2.2.2-1

Kwonho avatar Jun 07 '21 13:06 Kwonho

Could you update nvidia-docker2 to 2.6.0? This should pull in the other dependencies.

I will create a ticket to track adding this requirement to the documentation.

elezar avatar Jun 08 '21 05:06 elezar

I am also facing issue while deploying nvidia device plugin -v0.9.0

A100 GPU - mig enable

nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/1/0)
  MIG 2g.10gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/5/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)

kubectl nvidia-plugin logs

panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684

goroutine 1 [running]:
main.(*migStrategyMixed).GetPlugins(0x1042638, 0x5, 0xae1140, 0x1042638)
	/go/src/nvidia-device-plugin/mig-strategy.go:171 +0xa41
main.start(0xc4201a6e80, 0x0, 0x0)
	/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc42017b080, 0xae5a40, 0xc42019e010, 0xc4201a8000, 0x7, 0x7, 0x0, 0x0)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc42017b080, 0xc4201a8000, 0x7, 0x7, 0x4567e0, 0xc420211f50)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
	/go/src/nvidia-device-plugin/main.go:88 +0x751

anaconda2196 avatar Jul 15 '21 02:07 anaconda2196

Hi @anaconda2196. Is there only a single device in the host?

Which version of the CUDA driver and CUDA Container Toolkit (nvidia-docker) do you have installed? See https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html#mig-support-in-kubernetes

elezar avatar Jul 15 '21 08:07 elezar

Hi @elezar

k8s version - 1.20.2

If I tried with mig strategy: sigle then also I am facing same issue not only for nvidia-plugin version v.0.9.0 but also for v0.7.0 see (https://github.com/NVIDIA/k8s-device-plugin/issues/257)

yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
                                1.4.0-1                        @libnvidia-container
libnvidia-container1.x86_64     1.4.0-1                        @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1                        @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2                        @nvidia-container-runtime
nvidia-docker2.noarch           2.6.0-1                        @nvidia-docker   


migstrategy - single

# nvidia-smi
Thu Jul 15 13:58:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:86:00.0 Off |                   On |
| N/A   53C    P0    34W / 250W |     25MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)


# nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances:                                     |
| GPU   Name          Profile  Instance   Placement  |
|                       ID       ID       Start:Size |
|====================================================|
|   0  MIG 1g.5gb       19        7          4:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        8          5:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        9          6:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       11          0:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       12          1:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       13          2:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       14          3:1     |
+----------------------------------------------------+

# nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      7       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      8       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      9       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     11       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     12       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     13       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     14       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+

kubectl get node GPU-NODE -o yaml
looks like gpe-feature-discovery pods running correctly and assigned labels to the A100 gpu node.

labels:
...


nvidia.com/cuda.driver.major=450
                    nvidia.com/cuda.driver.minor=80
                    nvidia.com/cuda.driver.rev=02
                    nvidia.com/cuda.runtime.major=11
                    nvidia.com/cuda.runtime.minor=0
                    nvidia.com/gfd.timestamp=1626381819
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=7
                    nvidia.com/gpu.engines.copy=1
                    nvidia.com/gpu.engines.decoder=0
                    nvidia.com/gpu.engines.encoder=0
                    nvidia.com/gpu.engines.jpeg=0
                    nvidia.com/gpu.engines.ofa=0
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=ProLiant-DL380-Gen10
                    nvidia.com/gpu.memory=4864
                    nvidia.com/gpu.multiprocessors=14
                    nvidia.com/gpu.product=A100-PCIE-40GB-MIG-1g.5gb
                    nvidia.com/gpu.slices.ci=1
                    nvidia.com/gpu.slices.gi=1
                    nvidia.com/mig.strategy=single
...

kubectl -n kube-system get pods
NAME                                                          READY   STATUS    RESTARTS   AGE

gpu-feature-discovery-dfrwf                                   1/1     Running   0          6m18s

nfd-master-6dd87d999-spkqp                                    1/1     Running   0          6m33s
nfd-worker-w2wbf                                              1/1     Running   0          6m33s
nvidia-device-plugin-9p462                                    0/1     Error     6          6m37s

$ kubectl -n kube-system logs nvidia-device-plugin-9p462
2021/07/15 20:49:32 Loading NVML
2021/07/15 20:49:32 Starting FS watcher.
2021/07/15 20:49:32 Starting OS watcher.
2021/07/15 20:49:32 Retreiving plugins.
2021/07/15 20:49:32 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
	/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc4201acec0, 0x0, 0x0)
	/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc42017ad80, 0xae5a40, 0xc4201a4010, 0xc4201ae000, 0x7, 0x7, 0x0, 0x0)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc42017ad80, 0xc4201ae000, 0x7, 0x7, 0x4567e0, 0xc420219f50)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
	/go/src/nvidia-device-plugin/main.go:88 +0x751

Problem is with resource type on my A100 gpu node.

I am getting

kubectl describe node
...
Capacity:
nvidia.com/gpu: 0
...
Allocatable:
nvidia.com/gpu: 0

...


anaconda2196 avatar Jul 15 '21 21:07 anaconda2196

with migstrategy=single checked with both version - v0.7.0 | v.0.9.0

After upgrading drivers

nvidia-smi
Thu Jul 15 16:40:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:86:00.0 Off |                   On |
| N/A   56C    P0    35W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
                                1.4.0-1                        @libnvidia-container
libnvidia-container1.x86_64     1.4.0-1                        @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1                        @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2                        @nvidia-container-runtime
nvidia-docker2.noarch           2.6.0-1                        @nvidia-docker   

nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances:                                     |
| GPU   Name          Profile  Instance   Placement  |
|                       ID       ID       Start:Size |
|====================================================|
|   0  MIG 1g.5gb       19        7          4:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        8          5:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        9          6:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       11          0:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       12          1:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       13          2:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       14          3:1     |
+----------------------------------------------------+

nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      7       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      8       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      9       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     11       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     12       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     13       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     14       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+

gpu-feature-discovery pod running correctly and applied correct labels to A100 GPU node whether if migstrategy=single / mixed.

Problem with nvidia-plugin pod - crashingoff

v0.9.0

kubectl -n kube-system logs nvidia-device-plugin-xgv7t
2021/07/15 23:49:47 Loading NVML
2021/07/15 23:49:47 Starting FS watcher.
2021/07/15 23:49:47 Starting OS watcher.
2021/07/15 23:49:47 Retreiving plugins.
2021/07/15 23:49:47 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
	/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc42016eec0, 0x0, 0x0)
	/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc4202e2000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc4202e2000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc4201fbf50)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
	/go/src/nvidia-device-plugin/main.go:88 +0x751


v0.7.0

kubectl -n kube-system logs nvidia-device-plugin-jl85z
2021/07/15 23:57:37 Loading NVML
2021/07/15 23:57:37 Starting FS watcher.
2021/07/15 23:57:37 Starting OS watcher.
2021/07/15 23:57:37 Retreiving plugins.
2021/07/15 23:57:37 Shutdown of NVML returned: <nil>
panic: No MIG devices present on node

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0xfdbb58, 0x6, 0xa9a700, 0xfdbb58)
	/go/src/nvidia-device-plugin/mig-strategy.go:115 +0x43f
main.main()
	/go/src/nvidia-device-plugin/main.go:103 +0x413

anaconda2196 avatar Jul 16 '21 00:07 anaconda2196

Same here, crashlooping with 0.12.2

panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0

dimm0 avatar Jun 23 '22 02:06 dimm0

Looks like this is a race condition issue. Having the label nvidia.com/mig.config set on the node in question should trigger the mig-manager allowing the device plugin to succeed.

abn avatar Aug 24 '22 16:08 abn

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]