k8s-device-plugin
k8s-device-plugin copied to clipboard
Resource type labelling is incomplete/incorrect
Hi,
I am using nvidia-plugin version - v0.7.0 gpu-feature-discovery version - v0.4.1 k8s version - 1.20.2
on my A100 gpu machine:
nvidia-docker version
NVIDIA Docker: 2.6.0
nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 0 13 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 5 MIG 2g.10gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 0 1 MIG 3g.20gb 2 0 0:3 |
+--------------------------------------------------------------------+
cat /etc/docker/daemon.json
{
"log-driver":"json-file",
"log-opts": { "max-size" : "10m", "max-file" : "10" }
, "runtimes": { "nvidia": { "path": "/usr\/bin\/nvidia-container-runtime","runtimeArgs": []}}
, "default-runtime" : "nvidia"
}
I am able to successfully deployed gpu-feature-discovery pods as well as nvidia-plugin pods.
kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-6rdc8 1/1 Running 0 24m
nfd-master-6dd87d999-4xlrw 1/1 Running 0 24m
nfd-worker-rrwms 1/1 Running 0 24m
nvidia-device-plugin-fsq2j 1/1 Running 0 24m
nvidiagpubeat-59jd5 1/1 Running 0 88m
labels have been applied for this MIG strategy: mixed on my A100 gpu as below:
kubectl get node GPU_NODE -o yaml
labels:
....
...
nvidia.com/cuda.driver.major: "450"
nvidia.com/cuda.driver.minor: "80"
nvidia.com/cuda.driver.rev: "02"
nvidia.com/cuda.runtime.major: "11"
nvidia.com/cuda.runtime.minor: "0"
nvidia.com/gfd.timestamp: "1626160093"
nvidia.com/gpu.compute.major: "8"
nvidia.com/gpu.compute.minor: "0"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.family: ampere
nvidia.com/gpu.machine: ProLiant-DL380-Gen10
nvidia.com/gpu.memory: "40537"
nvidia.com/gpu.product: A100-PCIE-40GB
nvidia.com/mig-1g.5gb.count: "1"
nvidia.com/mig-1g.5gb.engines.copy: "1"
nvidia.com/mig-1g.5gb.engines.decoder: "0"
nvidia.com/mig-1g.5gb.engines.encoder: "0"
nvidia.com/mig-1g.5gb.engines.jpeg: "0"
nvidia.com/mig-1g.5gb.engines.ofa: "0"
nvidia.com/mig-1g.5gb.memory: "4864"
nvidia.com/mig-1g.5gb.multiprocessors: "14"
nvidia.com/mig-1g.5gb.slices.ci: "1"
nvidia.com/mig-1g.5gb.slices.gi: "1"
nvidia.com/mig-2g.10gb.count: "1"
nvidia.com/mig-2g.10gb.engines.copy: "2"
nvidia.com/mig-2g.10gb.engines.decoder: "1"
nvidia.com/mig-2g.10gb.engines.encoder: "0"
nvidia.com/mig-2g.10gb.engines.jpeg: "0"
nvidia.com/mig-2g.10gb.engines.ofa: "0"
nvidia.com/mig-2g.10gb.memory: "9984"
nvidia.com/mig-2g.10gb.multiprocessors: "28"
nvidia.com/mig-2g.10gb.slices.ci: "2"
nvidia.com/mig-2g.10gb.slices.gi: "2"
nvidia.com/mig-3g.20gb.count: "1"
nvidia.com/mig-3g.20gb.engines.copy: "3"
nvidia.com/mig-3g.20gb.engines.decoder: "2"
nvidia.com/mig-3g.20gb.engines.encoder: "0"
nvidia.com/mig-3g.20gb.engines.jpeg: "0"
nvidia.com/mig-3g.20gb.engines.ofa: "0"
nvidia.com/mig-3g.20gb.memory: "20096"
nvidia.com/mig-3g.20gb.multiprocessors: "42"
nvidia.com/mig-3g.20gb.slices.ci: "3"
nvidia.com/mig-3g.20gb.slices.gi: "3"
nvidia.com/mig.strategy: mixed
...
...
gpu-feature-discovery pod is working correctly However problem is with resource type on my A100 gpu node.
I would expect to get as below:
kubectl describe node
...
Capacity:
nvidia.com/mig-1g.5gb: 1
nvidia.com/mig-2g.10gb: 1
nvidia.com/mig-3g.20gb: 1
...
Allocatable:
nvidia.com/mig-1g.5gb: 1
nvidia.com/mig-2g.10gb: 1
nvidia.com/mig-3g.20gb: 1
I am getting
kubectl describe node
...
Capacity:
nvidia.com/gpu: 0
...
Allocatable:
nvidia.com/gpu: 0
...
Also, getting below error while checking nvidia-plugin logs.
kubectl -n kube-system logs nvidia-device-plugin-fsq2j 2021/07/14 22:45:43 Loading NVML 2021/07/14 22:45:43 Starting FS watcher. 2021/07/14 22:45:43 Starting OS watcher. 2021/07/14 22:45:43 Retreiving plugins. 2021/07/14 22:45:43 No devices found. Waiting indefinitely.
I am going through docs but not able to figured it out what is the issue going on. Any help would be much appreciated.
Thank you
Looks like this is the similar issue of https://github.com/NVIDIA/k8s-device-plugin/issues/192. @klueska
with migstrategy=single checked with both version - v0.7.0 | v.0.9.0
After upgrading drivers
nvidia-smi
Thu Jul 15 16:40:54 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:86:00.0 Off | On |
| N/A 56C P0 35W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 8 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 14 0 6 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
# yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
1.4.0-1 @libnvidia-container
libnvidia-container1.x86_64 1.4.0-1 @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1 @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2 @nvidia-container-runtime
nvidia-docker2.noarch 2.6.0-1 @nvidia-docker
nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|====================================================|
| 0 MIG 1g.5gb 19 7 4:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 8 5:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 9 6:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 11 0:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 12 1:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 13 2:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 14 3:1 |
+----------------------------------------------------+
nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 0 7 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 8 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 9 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 11 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 12 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 13 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 14 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
gpu-feature-discovery pod running correctly and applied correct labels to A100 GPU node whether if migstrategy=single / mixed.
Problem with nvidia-plugin pod - crashingoff
v0.9.0
kubectl -n kube-system logs nvidia-device-plugin-xgv7t
2021/07/15 23:49:47 Loading NVML
2021/07/15 23:49:47 Starting FS watcher.
2021/07/15 23:49:47 Starting OS watcher.
2021/07/15 23:49:47 Retreiving plugins.
2021/07/15 23:49:47 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684
goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc42016eec0, 0x0, 0x0)
/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc4202e2000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc4202e2000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc4201fbf50)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
/go/src/nvidia-device-plugin/main.go:88 +0x751
v0.7.0
kubectl -n kube-system logs nvidia-device-plugin-jl85z
2021/07/15 23:57:37 Loading NVML
2021/07/15 23:57:37 Starting FS watcher.
2021/07/15 23:57:37 Starting OS watcher.
2021/07/15 23:57:37 Retreiving plugins.
2021/07/15 23:57:37 Shutdown of NVML returned: <nil>
panic: No MIG devices present on node
goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0xfdbb58, 0x6, 0xa9a700, 0xfdbb58)
/go/src/nvidia-device-plugin/mig-strategy.go:115 +0x43f
main.main()
/go/src/nvidia-device-plugin/main.go:103 +0x413
Hi @anaconda2196 could you attach the node labels detected by gpu-feature-discovery
? Too keep it simple, let's consider the mig-strategy=single case.
@anaconda2196 I noted from the nvidia-smi
output that you have persistence mode disabled. Would it be possible to see what effect enabling persistence mode has on this?
Hi @elezar
I have enabled persistence mode on my A100 gpu node, but no luck with same error.
nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Fri Jul 16 07:13:16 2021
Driver Version : 460.73.01
CUDA Version : 11.2
Attached GPUs : 1
GPU 00000000:86:00.0
Product Name : A100-PCIE-40GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Enabled
Pending : Enabled
MIG Device
Index : 0
GPU Instance ID : 7
Compute Instance ID : 0
Device Attributes
Shared
Multiprocessor count : 14
Copy Engine count : 1
Encoder count : 0
Decoder count : 0
OFA count : 0
JPG count : 0
nvidia-smi
Fri Jul 16 07:27:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:86:00.0 Off | On |
| N/A 47C P0 33W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 8 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 14 0 6 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
node labels:
node-role.kubernetes.io/worker=
nvidia.com/cuda.driver.major=460
nvidia.com/cuda.driver.minor=73
nvidia.com/cuda.driver.rev=01
nvidia.com/cuda.runtime.major=11
nvidia.com/cuda.runtime.minor=2
nvidia.com/gfd.timestamp=1626445770
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=0
nvidia.com/gpu.count=7
nvidia.com/gpu.engines.copy=1
nvidia.com/gpu.engines.decoder=0
nvidia.com/gpu.engines.encoder=0
nvidia.com/gpu.engines.jpeg=0
nvidia.com/gpu.engines.ofa=0
nvidia.com/gpu.family=ampere
nvidia.com/gpu.machine=ProLiant-DL380-Gen10
nvidia.com/gpu.memory=4864
nvidia.com/gpu.multiprocessors=14
nvidia.com/gpu.product=A100-PCIE-40GB-MIG-1g.5gb
nvidia.com/gpu.slices.ci=1
nvidia.com/gpu.slices.gi=1
nvidia.com/mig.strategy=single
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-mhd6c 1/1 Running 0 6m50s
nfd-master-6dd87d999-xmg7z 1/1 Running 0 6m50s
nfd-worker-qm77s 1/1 Running 0 6m50s
nvidia-device-plugin-sss4g 0/1 CrashLoopBackOff 6 6m53s
$ kubectl -n kube-system logs nvidia-device-plugin-sss4g
2021/07/16 14:35:32 Loading NVML
2021/07/16 14:35:32 Starting FS watcher.
2021/07/16 14:35:32 Starting OS watcher.
2021/07/16 14:35:32 Retreiving plugins.
2021/07/16 14:35:32 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684
goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc4201b4e80, 0x0, 0x0)
/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc42018ec00, 0xae5a40, 0xc4201a4010, 0xc4201b6000, 0x7, 0x7, 0x0, 0x0)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc42018ec00, 0xc4201b6000, 0x7, 0x7, 0x4567e0, 0xc420221f50)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
/go/src/nvidia-device-plugin/main.go:88 +0x751
This is very strange behaviour. The underlying code to detect and enumerate the various MIG devices is shared between gpu-feature-discovery and the k8s-device-plugin. We’ve also not had similar reports to this from anyone else running the same versions of everything you’ve listed here.
Can you update the plugin podspec to ignore the failure on the plugin executable itself and then run a ‚sleep forever‘?
And once you’ve done that, exec into the container and run nvidia-smi.
Hi @klueska @elezar weird. inside pod if I run nvidia-smi then it is showing no mig devices.
nvidia-device-plugin-qffmb 1/1 Running 0 16s
Abhisheks-MacBook-Pro:~ abhishekacharya$ kubectl -n kube-system logs nvidia-device-plugin-qffmb
Abhisheks-MacBook-Pro:~ abhishekacharya$ kubectl -n kube-system exec -it nvidia-device-plugin-qffmb -- bash
root@nvidia-device-plugin-qffmb:/#
root@nvidia-device-plugin-qffmb:/#
root@nvidia-device-plugin-qffmb:/# nvidia-smi
Fri Jul 16 17:51:35 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:86:00.0 Off | On |
| N/A 48C P0 33W / 250W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Can you show your podspec?
kubectl -n kube-system describe pod nvidia-device-plugin-qffmb
Name: nvidia-device-plugin-qffmb
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: xxxx--NODE-NAME--xxx
Start Time: Fri, 16 Jul 2021 10:50:53 -0700
Labels: app.kubernetes.io/instance=nvidia-device-plugin
app.kubernetes.io/name=nvidia-device-plugin
controller-revision-hash=d9466b556
pod-template-generation=1
Annotations: cni.projectcalico.org/podIP: 10.192.0.227/32
cni.projectcalico.org/podIPs: 10.192.0.227/32
kubernetes.io/psp: hcp-psp-privileged
scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 10.192.0.227
IPs:
IP: 10.192.0.227
Controlled By: DaemonSet/nvidia-device-plugin
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://4d93a5fe0c44b2452cd0f05beed87503e985e2dbd884735dda66165b4bd3ac71
Image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:964847cc3fd85ead286be1d74d961f53d638cd4875af51166178b17bba90192f
Port: <none>
Host Port: <none>
Command:
sh
-c
sleep infinity
Args:
--mig-strategy=single
--pass-device-specs=true
--fail-on-init-error=true
--device-list-strategy=envvar
--device-id-strategy=uuid
--nvidia-driver-root=/
State: Running
Started: Fri, 16 Jul 2021 10:50:57 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-khb4p (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
default-token-khb4p:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-khb4p
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24m default-scheduler Successfully assigned kube-system/nvidia-device-plugin-qffmb to xxxx--NODE-NAME--xxx
Normal Pulling 24m kubelet Pulling image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
Normal Pulled 24m kubelet Successfully pulled image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0" in 2.299548038s
Normal Created 24m kubelet Created container nvidia-device-plugin-ctr
Normal Started 24m kubelet Started container nvidia-device-plugin-ctr
On A100 GPU machine - mig devices are already enabled.
nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)
But,
two things I have notice if I exec into pod
root@nvidia-device-plugin-qffmb:/var/lib/kubelet/device-plugins# ls
DEPRECATION kubelet.sock kubelet_internal_checkpoint
/proc/driver/nvidia sub-dirs are not mounted inside the container. On my A100 gpu machine
[root@xxx]# ls /proc/driver/nvidia
capabilities gpus params patches registry suspend suspend_depth version warnings
Inside the container:
root@nvidia-device-plugin-qffmb:/# ls /proc/driver/nvidia
gpus params registry version
If libnvidia-container doesn’t think you want access to MIG devices, it won’t inject these proc files (or the dev nodes they point to). When it comes to the plugin, this typically happens if you set NVIDIA_VISIBLE_DEVICES to something other than ‚all‘. That’s why I wanted to see your pod spec but it doesn’t seem you’ve overridden this.
Can you show us the value just to be sure.
values.yaml
legacyDaemonsetAPI: false
compatWithCPUManager: false
migStrategy: none
failOnInitError: true
deviceListStrategy: envvar
deviceIDStrategy: uuid
nvidiaDriverRoot: "/"
nameOverride: ""
fullnameOverride: ""
selectorLabelsOverride: {}
namespace: kube-system
imagePullSecrets: []
image:
repository: nvcr.io/nvidia/k8s-device-plugin
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: ""
updateStrategy:
type: RollingUpdate
podSecurityContext: {}
securityContext: {}
resources: {}
nodeSelector: {}
affinity: {}
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
my daemonset.yaml
spec:
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- image: {{ include "nvidia-device-plugin.fullimage" . }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
name: nvidia-device-plugin-ctr
command: ["sh", "-c", "sleep infinity"]
args:
- "--mig-strategy={{ .Values.migStrategy }}"
- "--pass-device-specs={{ .Values.compatWithCPUManager }}"
- "--fail-on-init-error={{ .Values.failOnInitError }}"
- "--device-list-strategy={{ .Values.deviceListStrategy }}"
- "--device-id-strategy={{ .Values.deviceIDStrategy }}"
- "--nvidia-driver-root={{ .Values.nvidiaDriverRoot }}"
securityContext:
{{- if ne (len .Values.securityContext) 0 }}
{{- toYaml .Values.securityContext | nindent 10 }}
{{- else if .Values.compatWithCPUManager }}
privileged: true
{{- else }}
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
{{- end }}
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
{{- with .Values.resources }}
resources:
{{- toYaml . | nindent 10 }}
{{- end }}
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- MY-GPU-NODE
containers:
- args:
- --mig-strategy=single
- --pass-device-specs=true
- --fail-on-init-error=true
- --device-list-strategy=envvar
- --device-id-strategy=uuid
- --nvidia-driver-root=/
command:
- sh
- -c
- sleep infinity
image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
imagePullPolicy: Always
name: nvidia-device-plugin-ctr
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/device-plugins
name: device-plugin
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-khb4p
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: mip-bd-dev80.mip.storage.hpecorp.net
preemptionPolicy: PreemptLowerPriority
priority: 2000001000
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/device-plugins
type: ""
name: device-plugin
- name: default-token-khb4p
secret:
defaultMode: 420
secretName: default-token-khb4p
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-07-16T17:50:53Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-07-16T17:50:57Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-07-16T17:50:57Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-07-16T17:50:53Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://4d93a5fe0c44b2452cd0f05beed87503e985e2dbd884735dda66165b4bd3ac71
image: nvidia/k8s-device-plugin:v0.9.0
imageID: docker-pullable://nvidia/k8s-device-plugin@sha256:964847cc3fd85ead286be1d74d961f53d638cd4875af51166178b17bba90192f
lastState: {}
name: nvidia-device-plugin-ctr
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-07-16T17:50:57Z"
hostIP: 16.0.13.204
phase: Running
podIP: 10.192.0.227
podIPs:
- ip: 10.192.0.227
qosClass: BestEffort
startTime: "2021-07-16T17:50:53Z"
kubectl -n kube-system exec -it nvidia-device-plugin-qffmb -- bash
root@nvidia-device-plugin-qffmb:/# echo $NVIDIA_DISABLE_REQUIRE
true
Sorry. I just meant exec into the container and show the value of NVIDIA_VISIBLE_DEVICES environment variable.
root@nvidia-device-plugin-mr26n:/# echo $NVIDIA_VISIBLE_DEVICES
all
Hi @elezar @klueska Finally, I got the solution - https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_MIG_User_Guide.pdf
“Toggling MIG mode requires the CAP_SYS_ADMIN capability. Other MIG management, such as creating and destroying instances, requires superuser by default, but can be delegated to non privileged users by adjusting permissions to MIG capabilities in /proc/” - page:12.
https://github.com/NVIDIA/nvidia-container-runtime
set env NVIDIA_MIG_CONFIG_DEVICES = all
Thank you so much for your support.
PS - Currently I have only tested with mig-strategy = single need to test with mixed and will update the ticket soon!
Glad to hear it’s working now, but you shouldn‘t need to set this on the plugin. Im still curious what is different about your setup that is requiring this.
Yes, There is little bit confusion. Because, I have removed my A100 GPU machine from my setup and install fresh OS - Centos79 on it. Then, I have followed this document - https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
Basically, I have created mig devices (mig-strategy=single) and installed docker, nvidia-driver, nvidia-container-toolki as described in the documentation.
yum list installed | grep nvidia
libnvidia-container-tools.x86_64 1.4.0-1 @libnvidia-container
libnvidia-container1.x86_64 1.4.0-1 @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1 @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2 @nvidia-container-runtime
nvidia-docker2.noarch 2.6.0-1 @nvidia-docker
Strange behaviour is that if i simple run -
docker run --rm --gpus=all nvidia/cuda:11.0-base nvidia-smi
Mon Jul 19 18:02:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:86:00.0 Off | On |
| N/A 73C P0 56W / 250W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
but when I pass parameters and env variable then it works.
docker run --rm --gpus=all --cap-add SYS_ADMIN -e NVIDIA_MIG_CONFIG_DEVICES="all" nvidia/cuda:11.0-base nvidia-smi
Mon Jul 19 18:03:44 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:86:00.0 Off | On |
| N/A 73C P0 57W / 250W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 8 0 1 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 3 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 4 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 5 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 6 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
So, now it is unclear as per provided docs. Because, I have to provide this parameter and env variable in my current setup to able to deploy nvidia-plugin pod. Can you check it - https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker?
Thank you for your time in advance.
Hi @elezar @klueska
Here, one more issue I am facing, I am able deploy nvidia-plugin , gpu-feature-discovery pod successfully on A100 GPU node with mig-strategy=single
A100 GPU | mig-strategy=single | Centos7 | kubernetes version - 1.20.2 | Docker version 20.10.7
yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
1.4.0-1 @libnvidia-container
libnvidia-container1.x86_64 1.4.0-1 @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1 @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2 @nvidia-container-runtime
nvidia-docker2.noarch 2.6.0-1 @nvidia-docker
# cat /etc/docker/daemon.json
{
"log-driver":"json-file",
"log-opts": { "max-size" : "10m", "max-file" : "10" }
, "runtimes": { "nvidia": { "path": "/usr\/bin\/nvidia-container-runtime","runtimeArgs": []}}
, "default-runtime" : "nvidia"
}
# nvidia-docker version
NVIDIA Docker: 2.6.0
/usr/bin/nvidia-docker: line 34: /usr/bin/docker: Permission denied
/usr/bin/nvidia-docker: line 34: /usr/bin/docker: Success
# nvidia-smi
Fri Jul 23 09:39:14 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:86:00.0 Off | On |
| N/A 53C P0 35W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 8 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 14 0 6 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
# nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)
kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-single-9mm7x 1/1 Running 0 15h
nfd-master-dd9568c97-lt5th 1/1 Running 0 17h
nfd-worker-5tlhj 1/1 Running 0 15h
nfd-worker-ftr22 1/1 Running 0 17h
nvidia-device-plugin-single-ct2v6 1/1 Running 0 15h
nvidiagpubeat-xrjsv 1/1 Running 0 15h
$ kubectl -n kube-system logs nvidia-device-plugin-single-ct2v6
2021/07/23 01:14:02 Loading NVML
2021/07/23 01:14:02 Starting FS watcher.
2021/07/23 01:14:02 Starting OS watcher.
2021/07/23 01:14:02 Retreiving plugins.
2021/07/23 01:14:02 Starting GRPC server for 'nvidia.com/gpu'
2021/07/23 01:14:02 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/07/23 01:14:02 Registered device plugin for 'nvidia.com/gpu' with Kubelet
Following - https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html
For single strategy - point #7 - Deploy 7 pods, each consuming one MIG device (then read their logs and delete them)
It is not detecting mig-devices
for i in $(seq 7); do echo "mig-single-example-${i}"; kubectl logs mig-single-example-${i}; echo ""; done mig-single-example-1
mig-single-example-2
mig-single-example-3
mig-single-example-4
mig-single-example-5
mig-single-example-6
mig-single-example-7
When I do kubectl describe pod -
Error: failed to start container "mig-single-example-1": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0: unknown device: unknown
Can you help me with that? which library or any configuration setup if I am still missing.
However interesting fact is that: if I pass parameter with privileged true and env variable then it work, but I don't want to do that.
$for i in $(seq 7); do kubectl run --privileged=true --image=nvidia/cuda:11.0-base --env="NVIDIA_VISIBLE_DEVICES=all" --env="NVIDIA_MIG_CONFIG_DEVICES=all" --restart=Never --limits=nvidia.com/gpu=1 mig-single-example-${i} -- bash -c "nvidia-smi -L; sleep infinity"; done
pod/mig-single-example-1 created
pod/mig-single-example-2 created
pod/mig-single-example-3 created
pod/mig-single-example-4 created
pod/mig-single-example-5 created
pod/mig-single-example-6 created
pod/mig-single-example-7 created
$ kubectl -n default get pods
NAME READY STATUS RESTARTS AGE
mig-single-example-1 1/1 Running 0 17s
mig-single-example-2 1/1 Running 0 17s
mig-single-example-3 1/1 Running 0 16s
mig-single-example-4 1/1 Running 0 16s
mig-single-example-5 1/1 Running 0 15s
mig-single-example-6 1/1 Running 0 15s
mig-single-example-7 1/1 Running 0 14s
$ for i in $(seq 7); do
> echo "mig-single-example-${i}";
> kubectl logs mig-single-example-${i}
> echo "";
> done
mig-single-example-1
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)
...
..
......
@klueska @elezar
Okay, It worked for me on OS Suse (Sles15sp2) but didn't work on Centos.
I have checked and compared config.toml for both OS, everything is same except user = "root:video"
.
In Suse (Sles15sp2) in config.toml user = "root:video"
uncommented while in Centos in config.toml user = "root:video"
is commented.
So, I have manually changed config.toml and uncommented user = "root:video"
. Then it worked for centos too.
Also, I came across this - https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/config/config.toml.centos https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/config/config.toml.opensuse-leap
Can you help me to understand this - why this is the difference for both OS in config.toml? Also, What is the possible solution for this that I don't have to change manually in Centos in config.toml.
Thank you and in truly appreciate your advice and help.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.