gpu-operator
gpu-operator copied to clipboard
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
We are using GKE cluster(v1.23.8-gke.1900) with Nvidia multi-instance A100 Gpu nodes. We want to install Nvidia Gpu-operator on this cluster.
The default container-runtime in our case is containerd, so we followed the following steps to change the container runtime to nvidia i.e by adding below config to /etc/containerd/config.toml and restarting containerd service.:
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Then restarting the containerd daemon:
systemctl restart containerd
We have followed the following steps to deploy Gpu-operator: Note: The nvidia drivers and device-plugins are already present by default in the kube-system namespace.
-
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
&& chmod 700 get_helm.sh
&& ./get_helm.sh -
helm repo add nvidia https://nvidia.github.io/gpu-operator
&& helm repo update -
helm install gpu-operator nvidia/gpu-operator --set operator.defaultRuntime=containerd --namespace kube-system --create-namespace --set driver.enabled=false --set mig.strategy=single
Now when I run kubectl get all -n kube-system command, all the below listed pods are not coming up:
pod/gpu-feature-discovery-5lxwl 0/1 Init:0/1 0 85m
pod/gpu-feature-discovery-6cjmm 0/1 Init:0/1 0 85m
pod/gpu-feature-discovery-bc5c9 0/1 Init:0/1 0 85m
pod/gpu-feature-discovery-chhng 0/1 Init:0/1 0 85m
pod/nvidia-container-toolkit-daemonset-47g2p 0/1 Init:0/1 0 85m
pod/nvidia-container-toolkit-daemonset-r6s67 0/1 Init:0/1 0 85m
pod/nvidia-container-toolkit-daemonset-svksk 0/1 Init:0/1 0 85m
pod/nvidia-container-toolkit-daemonset-z5m4r 0/1 Init:0/1 0 85m
pod/nvidia-dcgm-exporter-49t58 0/1 Init:0/1 0 85m
pod/nvidia-dcgm-exporter-k6wbg 0/1 Init:0/1 0 85m
pod/nvidia-dcgm-exporter-m8jrq 0/1 Init:0/1 0 85m
pod/nvidia-dcgm-exporter-p5tl9 0/1 Init:0/1 0 85m
pod/nvidia-device-plugin-daemonset-2jnnn 0/1 Init:0/1 0 85m
pod/nvidia-device-plugin-daemonset-hwmlp 0/1 Init:0/1 0 85m
pod/nvidia-device-plugin-daemonset-qnv6n 0/1 Init:0/1 0 85m
pod/nvidia-device-plugin-daemonset-zxvgh 0/1 Init:0/1 0 85m
When I tried to describe one of the dcgm pods by running kubectl describe pod nvidia-dcgm-exporter-49t58 -n kube-system it showing the following error:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 4m22s (x394 over 89m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Please suggest if we are missing something & how can we resolve this issue quickly.
@searce-aditya I see from the list of pods that you are adeploying the nvidia-container-toolkit-daemonset
, however your comments seem to indicate that the driver and the NVIDIA Container Toolkit are installed on the host.
Could you add --set toolkit.enabled=false
to the options you use when deploying the operator?
Hello @elezar we tried adding --set toolkit.enabled=false while deploying operator, still facing the same issue
Is the driver ready on the node? Can you share the output of nvidia-smi
from the node.
I've a different setup, but facing exactly the same error. I tried using chart 1.11.1 and 22.9.0 with same results:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set version=22.9.0 --set driver.enabled=false --set operator.defaultRuntime=containerd
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:05:00.0 On | N/A |
| 0% 35C P8 13W / 170W | 101MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1577 G /usr/lib/xorg/Xorg 70MiB |
| 0 N/A N/A 1762 G /usr/bin/gnome-shell 28MiB |
+-----------------------------------------------------------------------------+
I tried with 22.9.0 following the [Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html) matrix
I'm running into the same problem on a Nvidia DGX Station A100 running microk8s 1.25.2 and following the process outlined in the GPU-operator manual for DGX systems.
The container environment works as expected in Docker.
Experiencing this same issue running K8s v1.21.14 and containerd. I have tried all suggestions in this issue and the issue has not resolved.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
This is not an error and expected until drivers are loaded and nvidia-container-runtime
is setup. With driver containers, it takes about 3-4minutes to setup/load drivers and toolkit, after that these errors should go away. The reason these errors appear is we run some operands(device-plugin, gfd, dcgm etc) with runtimeClass
set to nvidia
, and this will cause above error until drivers/toolkit are ready.
Please ignore these errors and let us know the state of all pods running with GPU operator. If any of pod is still in init
phase, we need to debug that component. Make sure to understand caveats with pre-installed drivers and container-toolkit in some cases as described here.
It's definitely not a timing issue, these pods have been in this state for quite a while.
kube-system calico-node-lkd9l 1/1 Running 0 2d23h
kube-system calico-kube-controllers-97b47d84d-2kghb 1/1 Running 0 2d23h
kube-system coredns-d489fb88-wnxnh 1/1 Running 0 2d23h
kube-system dashboard-metrics-scraper-64bcc67c9c-thq65 1/1 Running 0 2d23h
kube-system kubernetes-dashboard-74b66d7f9c-lkjcg 1/1 Running 0 2d23h
kube-system hostpath-provisioner-85ccc46f96-sqgnl 1/1 Running 0 2d23h
kube-system metrics-server-6b6844c455-w56rj 1/1 Running 0 2d23h
gpu-operator gpu-operator-1666083289-node-feature-discovery-worker-gq9fw 1/1 Running 0 2d23h
gpu-operator gpu-operator-5dc6b8989b-dflzh 1/1 Running 0 2d23h
gpu-operator gpu-operator-1666083289-node-feature-discovery-master-6f49r85q7 1/1 Running 0 2d23h
gpu-operator nvidia-operator-validator-z8bvv 0/1 Init:0/4 0 2d23h
gpu-operator nvidia-device-plugin-daemonset-c84bd 0/1 Init:0/1 0 2d23h
gpu-operator nvidia-dcgm-exporter-fn2cs 0/1 Init:0/1 0 2d23h
gpu-operator gpu-feature-discovery-4px4v 0/1 Init:0/1 0 2d23h
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 |
| N/A 41C P0 56W / 275W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 |
| N/A 41C P0 56W / 275W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:81:00.0 Off | 0 |
| N/A 41C P0 58W / 275W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA DGX Display On | 00000000:C1:00.0 Off | N/A |
| 34% 43C P8 N/A / 50W | 1MiB / 3911MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:C2:00.0 Off | 0 |
| N/A 42C P0 63W / 275W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Containerd and docker configs have been set up as instructed in the documentation. Initially, I tried to install the operator via microk8s GPU add-on but ran into issues. Following the recommended Helm3 installation, all expected pods were created, but some of them got stuck in the init phase.
Docker config:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Containerd:
#temp disabled
#disabled_plugins = ["cri"]
#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0
#[grpc]
# address = "/run/containerd/containerd.sock"
# uid = 0
# gid = 0
#[debug]
# address = "/run/containerd/debug.sock"
# uid = 0
# gid = 0
# level = "info"
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
More details on the feature-discovery pod:
Name: gpu-feature-discovery-4px4v
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: nvidia-gpu-feature-discovery
Node: dgxadmin-dgx-station-a100-920-23487-2530-000/10.36.40.65
Start Time: Tue, 18 Oct 2022 10:55:23 +0200
Labels: app=gpu-feature-discovery
app.kubernetes.io/part-of=nvidia-gpu
controller-revision-hash=5ffb7c7b8b
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/gpu-feature-discovery
Init Containers:
toolkit-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wqqm (ro)
Containers:
gpu-feature-discovery:
Container ID:
Image: nvcr.io/nvidia/gpu-feature-discovery:v0.6.2-ubi8
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
GFD_SLEEP_INTERVAL: 60s
GFD_FAIL_ON_INIT_ERROR: true
GFD_MIG_STRATEGY: single
NVIDIA_MIG_MONITOR_DEVICES: all
Mounts:
/etc/kubernetes/node-feature-discovery/features.d from output-dir (rw)
/sys/class/dmi/id/product_name from dmi-product-name (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wqqm (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
output-dir:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/node-feature-discovery/features.d
HostPathType:
dmi-product-name:
Type: HostPath (bare host directory volume)
Path: /sys/class/dmi/id/product_name
HostPathType:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: Directory
kube-api-access-5wqqm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.gpu-feature-discovery=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 4m5s (x19852 over 2d23h) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Logs of the feature-discovery container are unavailable due to the init status.
@LukasIAO Can you add following debug
options under /etc/nvidia-container-runtime/.config.toml
. Can you also confirm root
is set to /
here. With this if you restart any of the operand pods(operator-validator for e.g.) we should see logs in file /var/log/nvidia-container-runtime.log
and /var/log/nvidia-container-cli.log
. That will help us to confirm if the runtime hook is invoked or not by docker or containerd.
disable-require = false
[nvidia-container-cli]
debug = "/var/log/nvidia-container-toolkit.log"
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/"
[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
Hi @shivamerla, thank you very much for your help. I've added/enabled the recommended lines in the config. Some variables varied in my installation, which I left at their default value. I assume they are configured correctly, since the DGX OS ships with the runtime already installed.
Current config:
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
"docker-runc",
"runc",
]
mode = "auto"
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
Updated config:
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#temp change to '/'
root = "/"
#disabled by default
path = "/usr/bin/nvidia-container-cli"
environment = []
#disabled by default
debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"
[nvidia-container-runtime]
#disabled by default
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
"docker-runc",
"runc",
]
mode = "auto"
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
I've restarted the deamonset pod which is still running into the same issue.
Name: nvidia-device-plugin-daemonset-tltdv
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: nvidia-device-plugin
Node: dgxadmin-dgx-station-a100-920-23487-2530-000/10.36.40.65
Start Time: Mon, 24 Oct 2022 10:57:01 +0200
Labels: app=nvidia-device-plugin-daemonset
controller-revision-hash=784ffbf4f9
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
toolkit-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2dpht (ro)
Containers:
nvidia-device-plugin:
Container ID:
Image: nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8
Image ID:
Port: <none>
Host Port: <none>
Command:
bash
-c
Args:
[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
PASS_DEVICE_SPECS: true
FAIL_ON_INIT_ERROR: true
DEVICE_LIST_STRATEGY: envvar
DEVICE_ID_STRATEGY: uuid
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
MIG_STRATEGY: single
NVIDIA_MIG_MONITOR_DEVICES: all
Mounts:
/run/nvidia from run-nvidia (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2dpht (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: Directory
kube-api-access-2dpht:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.device-plugin=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 4m24s (x26 over 9m50s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Restarting the pod, unfortunately, did not create any nvidia-container-runtime or cli logs. However, here are the logs for the gpu-operator as well as the gpu-operator-namespace, perhaps they can be of some help.
/var/log/gpu-operator.log:
log_file: /var/log/gpu-manager.log
last_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
new_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
can't access /opt/amdgpu-pro/bin/amdgpu-pro-px
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-515srv
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-515
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-510srv
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-510
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-470srv
Found nvidia.ko module in /lib/modules/5.4.0-125-generic/kernel/nvidia-470srv/nvidia.ko
Looking for amdgpu modules in /lib/modules/5.4.0-125-generic/kernel
Looking for amdgpu modules in /lib/modules/5.4.0-125-generic/updates/dkms
Is nvidia loaded? yes
Was nvidia unloaded? no
Is nvidia blacklisted? no
Is intel loaded? no
Is radeon loaded? no
Is radeon blacklisted? no
Is amdgpu loaded? no
Is amdgpu blacklisted? no
Is amdgpu versioned? no
Is amdgpu pro stack? no
Is nouveau loaded? no
Is nouveau blacklisted? yes
Is nvidia kernel module available? yes
Is amdgpu kernel module available? no
Vendor/Device Id: 1a03:2000
BusID "PCI:70@0:0:0"
Is boot vga? yes
Vendor/Device Id: 10de:20b0
BusID "PCI:129@0:0:0"
can't open /sys/bus/pci/devices/0000:81:00.0/boot_vga
Is boot vga? no
can't open /sys/bus/pci/devices/0000:81:00.0/boot_vga
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x20b0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:81:00.0/power
Runtime D3 status: ?
Is nvidia runtime pm enabled for "0x20b0"? no
Vendor/Device Id: 10de:20b0
BusID "PCI:71@0:0:0"
can't open /sys/bus/pci/devices/0000:47:00.0/boot_vga
Is boot vga? no
can't open /sys/bus/pci/devices/0000:47:00.0/boot_vga
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x20b0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:47:00.0/power
Runtime D3 status: ?
Is nvidia runtime pm enabled for "0x20b0"? no
Vendor/Device Id: 10de:1fb0
BusID "PCI:193@0:0:0"
Is boot vga? no
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x1fb0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:c1:00.0/power
Runtime D3 status: ?
Is nvidia runtime pm enabled for "0x1fb0"? no
Vendor/Device Id: 10de:20b0
BusID "PCI:1@0:0:0"
can't open /sys/bus/pci/devices/0000:01:00.0/boot_vga
Is boot vga? no
can't open /sys/bus/pci/devices/0000:01:00.0/boot_vga
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x20b0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:01:00.0/power
Runtime D3 status: Disabled by default
Is nvidia runtime pm enabled for "0x20b0"? no
Vendor/Device Id: 10de:20b0
BusID "PCI:194@0:0:0"
can't open /sys/bus/pci/devices/0000:c2:00.0/boot_vga
Is boot vga? no
can't open /sys/bus/pci/devices/0000:c2:00.0/boot_vga
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x20b0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:c2:00.0/power
Runtime D3 status: ?
Is nvidia runtime pm enabled for "0x20b0"? no
Skipping "/dev/dri/card5", driven by "nvidia-drm"
Skipping "/dev/dri/card4", driven by "nvidia-drm"
Skipping "/dev/dri/card3", driven by "nvidia-drm"
Skipping "/dev/dri/card2", driven by "nvidia-drm"
Skipping "/dev/dri/card1", driven by "nvidia-drm"
Skipping "/dev/dri/card0", driven by "ast"
Skipping "/dev/dri/card5", driven by "nvidia-drm"
Skipping "/dev/dri/card4", driven by "nvidia-drm"
Skipping "/dev/dri/card3", driven by "nvidia-drm"
Skipping "/dev/dri/card2", driven by "nvidia-drm"
Skipping "/dev/dri/card1", driven by "nvidia-drm"
Skipping "/dev/dri/card0", driven by "ast"
Skipping "/dev/dri/card5", driven by "nvidia-drm"
Skipping "/dev/dri/card4", driven by "nvidia-drm"
Skipping "/dev/dri/card3", driven by "nvidia-drm"
Skipping "/dev/dri/card2", driven by "nvidia-drm"
Skipping "/dev/dri/card1", driven by "nvidia-drm"
Skipping "/dev/dri/card0", driven by "ast"
Skipping "/dev/dri/card5", driven by "nvidia-drm"
Skipping "/dev/dri/card4", driven by "nvidia-drm"
Skipping "/dev/dri/card3", driven by "nvidia-drm"
Skipping "/dev/dri/card2", driven by "nvidia-drm"
Skipping "/dev/dri/card1", driven by "nvidia-drm"
Skipping "/dev/dri/card0", driven by "ast"
Does it require offloading? no
last cards number = 6
Has amd? no
Has intel? no
Has nvidia? yes
How many cards? 6
Has the system changed? No
Takes 0ms to wait for nvidia udev rules completed.
Unsupported discrete card vendor: 10de
Nothing to do
/var/log/containers/gpu-operator*.logs:
2022-10-24T11:19:44.889924275+02:00 stderr F 1.666603184889807e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-node-status-exporter", "status": "disabled"}
2022-10-24T11:19:44.897367095+02:00 stderr F 1.666603184897253e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-vgpu-manager", "status": "disabled"}
2022-10-24T11:19:44.903645824+02:00 stderr F 1.6666031849035316e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-vgpu-device-manager", "status": "disabled"}
2022-10-24T11:19:44.911491405+02:00 stderr F 1.6666031849113731e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-sandbox-validation", "status": "disabled"}
2022-10-24T11:19:44.920587781+02:00 stderr F 1.6666031849204967e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-vfio-manager", "status": "disabled"}
2022-10-24T11:19:44.928322513+02:00 stderr F 1.6666031849282417e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-sandbox-device-plugin", "status": "disabled"}
2022-10-24T11:19:44.928330588+02:00 stderr F 1.666603184928259e+09 INFO controllers.ClusterPolicy ClusterPolicy isn't ready {"states not ready": ["state-operator-validation", "state-device-plugin", "state-dcgm-exporter", "gpu-feature-discovery"]}
2022-10-24T11:19:49.929276951+02:00 stderr F 1.6666031899291391e+09 INFO controllers.ClusterPolicy Sandbox workloads {"E
nabled": false, "DefaultWorkload": "container"}
2022-10-24T11:19:49.929296869+02:00 stderr F 1.666603189929195e+09 INFO controllers.ClusterPolicy GPU workload configuration{"NodeName": "dgxadmin-dgx-station-a100-920-23487-2530-000", "GpuWorkloadConfig": "container"}
2022-10-24T11:19:49.929301377+02:00 stderr F 1.6666031899292035e+09 INFO controllers.ClusterPolicy Checking GPU state labels o
n the node {"NodeName": "dgxadmin-dgx-station-a100-920-23487-2530-000"}
2022-10-24T11:19:49.929305134+02:00 stderr F 1.6666031899292104e+09 INFO controllers.ClusterPolicy Number of nodes with GPU la
bel {"NodeCount": 1}
2022-10-24T11:19:49.929308751+02:00 stderr F 1.6666031899292326e+09 INFO controllers.ClusterPolicy Using container runtime: co
ntainerd
2022-10-24T11:19:49.929341132+02:00 stderr F 1.6666031899292479e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"RuntimeClass": "nvidia"}
2022-10-24T11:19:49.932388012+02:00 stderr F 1.6666031899322655e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "pre-requisites", "status": "ready"}
2022-10-24T11:19:49.932422487+02:00 stderr F 1.666603189932333e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"Service": "gpu-operator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.935075583+02:00 stderr F 1.666603189934942e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-operator-metrics", "status": "ready"}
2022-10-24T11:19:49.944488527+02:00 stderr F 1.6666031899443653e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-driver", "status": "disabled"}
2022-10-24T11:19:49.950519708+02:00 stderr F 1.6666031899504201e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-container-toolkit", "status": "disabled"}
2022-10-24T11:19:49.952929514+02:00 stderr F 1.6666031899528463e+09 INFO controllers.ClusterPolicy Found Resource, skipping up
date {"ServiceAccount": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.955563323+02:00 stderr F 1.666603189955487e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"Role": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.959463015+02:00 stderr F 1.6666031899593854e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ClusterRole": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.963319546+02:00 stderr F 1.6666031899632366e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"RoleBinding": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.967390461+02:00 stderr F 1.6666031899673078e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ClusterRoleBinding": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.969474571+02:00 stderr F 1.666603189969397e+09 INFO controllers.ClusterPolicy DaemonSet identical, skippi
ng update {"DaemonSet": "nvidia-operator-validator", "Namespace": "gpu-operator", "name": "nvidia-operator-validator"}
2022-10-24T11:19:49.969479841+02:00 stderr F 1.6666031899694111e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"L
abelSelector": "app=nvidia-operator-validator"}
2022-10-24T11:19:49.969482196+02:00 stderr F 1.6666031899694364e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:49.969484289+02:00 stderr F 1.6666031899694402e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberUnavailable": 1}
2022-10-24T11:19:49.969486083+02:00 stderr F 1.6666031899694436e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-operator-validation", "status": "notReady"}
2022-10-24T11:19:49.971824964+02:00 stderr F 1.6666031899717581e+09 INFO controllers.ClusterPolicy Found Resource, skipping up
date {"ServiceAccount": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.974246151+02:00 stderr F 1.666603189974167e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"Role": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.97832913+02:00 stderr F 1.6666031899782453e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ClusterRole": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.982385448+02:00 stderr F 1.666603189982306e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"RoleBinding": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.98709994+02:00 stderr F 1.6666031899865522e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ClusterRoleBinding": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.989852654+02:00 stderr F 1.666603189989776e+09 INFO controllers.ClusterPolicy DaemonSet identical, skippi
ng update {"DaemonSet": "nvidia-device-plugin-daemonset", "Namespace": "gpu-operator", "name": "nvidia-device-plugin-daemonset"}
2022-10-24T11:19:49.989857924+02:00 stderr F 1.6666031899898036e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"L
abelSelector": "app=nvidia-device-plugin-daemonset"}
2022-10-24T11:19:49.989889704+02:00 stderr F 1.6666031899898317e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:49.989891998+02:00 stderr F 1.6666031899898388e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberUnavailable": 1}
2022-10-24T11:19:49.989894192+02:00 stderr F 1.666603189989843e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-device-plugin", "status": "notReady"}
2022-10-24T11:19:49.99555849+02:00 stderr F 1.6666031899954753e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-dcgm", "status": "disabled"}
2022-10-24T11:19:49.998001799+02:00 stderr F 1.6666031899979186e+09 INFO controllers.ClusterPolicy Found Resource, skipping up
date {"ServiceAccount": "nvidia-dcgm-exporter", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.000277611+02:00 stderr F 1.666603190000206e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"Role": "nvidia-dcgm-exporter", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.004443246+02:00 stderr F 1.666603190004365e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"RoleBinding": "nvidia-dcgm-exporter", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.006322138+02:00 stderr F 1.666603190006245e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"Service": "nvidia-dcgm-exporter", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.008463596+02:00 stderr F 1.6666031900083907e+09 INFO controllers.ClusterPolicy DaemonSet identical, skippi
ng update {"DaemonSet": "nvidia-dcgm-exporter", "Namespace": "gpu-operator", "name": "nvidia-dcgm-exporter"}
2022-10-24T11:19:50.00847625+02:00 stderr F 1.666603190008402e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"L
abelSelector": "app=nvidia-dcgm-exporter"}
2022-10-24T11:19:50.008479637+02:00 stderr F 1.6666031900084198e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:50.00848162+02:00 stderr F 1.6666031900084226e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberUnavailable": 1}
2022-10-24T11:19:50.008484415+02:00 stderr F 1.6666031900084267e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-dcgm-exporter", "status": "notReady"}
2022-10-24T11:19:50.01115833+02:00 stderr F 1.6666031900110757e+09 INFO controllers.ClusterPolicy Found Resource, skipping up
date {"ServiceAccount": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.013800435+02:00 stderr F 1.6666031900137217e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"Role": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.017510739+02:00 stderr F 1.6666031900174518e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ClusterRole": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.021401925+02:00 stderr F 1.666603190021325e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"RoleBinding": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.025175158+02:00 stderr F 1.6666031900250971e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ClusterRoleBinding": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.027109224+02:00 stderr F 1.6666031900270455e+09 INFO controllers.ClusterPolicy DaemonSet identical, skippi
ng update {"DaemonSet": "gpu-feature-discovery", "Namespace": "gpu-operator", "name": "gpu-feature-discovery"}
2022-10-24T11:19:50.027116348+02:00 stderr F 1.66660319002706e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"L
abelSelector": "app=gpu-feature-discovery"}
2022-10-24T11:19:50.027118752+02:00 stderr F 1.6666031900270865e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:50.027121447+02:00 stderr F 1.666603190027092e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberUnavailable": 1}
2022-10-24T11:19:50.027129413+02:00 stderr F 1.6666031900270965e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "gpu-feature-discovery", "status": "notReady"}
2022-10-24T11:19:50.029697307+02:00 stderr F 1.6666031900296195e+09 INFO controllers.ClusterPolicy Found Resource, skipping up
date {"ServiceAccount": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.031903999+02:00 stderr F 1.6666031900318484e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"Role": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.035654919+02:00 stderr F 1.6666031900355768e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ClusterRole": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.039583005+02:00 stderr F 1.6666031900394998e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"RoleBinding": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.043497384+02:00 stderr F 1.6666031900434043e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ClusterRoleBinding": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.047675753+02:00 stderr F 1.666603190047592e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ConfigMap": "default-mig-parted-config", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.051867748+02:00 stderr F 1.6666031900517852e+09 INFO controllers.ClusterPolicy Found Resource, updating...
{"ConfigMap": "default-gpu-clients", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.053732303+02:00 stderr F 1.666603190053671e+09 INFO controllers.ClusterPolicy DaemonSet identical, skippi
ng update {"DaemonSet": "nvidia-mig-manager", "Namespace": "gpu-operator", "name": "nvidia-mig-manager"}
2022-10-24T11:19:50.053737482+02:00 stderr F 1.666603190053683e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"L
abelSelector": "app=nvidia-mig-manager"}
2022-10-24T11:19:50.053739546+02:00 stderr F 1.6666031900537012e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:50.05374156+02:00 stderr F 1.6666031900537043e+09 INFO controllers.ClusterPolicy DEBUG: DaemonSet {"N
umberUnavailable": 0}
2022-10-24T11:19:50.053743965+02:00 stderr F 1.666603190053708e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-mig-manager", "status": "ready"}
2022-10-24T11:19:50.062922406+02:00 stderr F 1.666603190062841e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-node-status-exporter", "status": "disabled"}
2022-10-24T11:19:50.070873386+02:00 stderr F 1.6666031900707908e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-vgpu-manager", "status": "disabled"}
2022-10-24T11:19:50.077472792+02:00 stderr F 1.6666031900773983e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-vgpu-device-manager", "status": "disabled"}
2022-10-24T11:19:50.084970015+02:00 stderr F 1.6666031900848877e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-sandbox-validation", "status": "disabled"}
2022-10-24T11:19:50.093761595+02:00 stderr F 1.6666031900936785e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-vfio-manager", "status": "disabled"}
2022-10-24T11:19:50.101456491+02:00 stderr F 1.6666031901013722e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step co
mpleted {"state:": "state-sandbox-device-plugin", "status": "disabled"}
2022-10-24T11:19:50.101463524+02:00 stderr F 1.6666031901013896e+09 INFO controllers.ClusterPolicy ClusterPolicy isn't ready {"states not ready": ["state-operator-validation", "state-device-plugin", "state-dcgm-exporter", "gpu-feature-discovery"]}
ClusterPolicy isn't ready {"states not ready": ["state-operator-validation", "state-device-plugin", "state-dcgm-exporter", "gpu-feature-discovery"]}
Does this tell us anything useful?
@shivamerla Turns out an unrelated system reboot did lead to the creation of the /var/log/nvidia-container-runtime.log
file.
.
.
.
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/ebb013110bd0849f6ead70f5251c0d607cdfae181bca8f6f7ed05adc33c3af14/config.json","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:16+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:16+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/84f417760864250d40425109d07c6d03dddecf355a022f2001057726782c049a/config.json","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:59+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:59+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/146ba8cd3689b35d81c18c33bce37f687f3174f947f2a59bba2669437b355172/config.json","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:33:36+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:33:36+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/a46c815e945c9c29ec298bb247e3cbb6845dcaddf9c5463ec51b58c3cbbe5d3f/config.json","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:35:28+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:35:28+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/a504d55e7a9f34970a6c5eb0f8a69e15aa64fa7e1b75b1da0556cff8fa1da4fe/config.json","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:36:58+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:36:58+01:00"}
The /var/log/nvidia-container-cli.log
file is still missing, however. I hope it's still helpful.
@LukasIAO the other log file of interest may be /var/log/nvidia-container-toolkit.log
.
Hi @elezar, thank you for taking the time.
Here the toolkit.log
:
I1109 19:36:56.714689 2558903 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I1109 19:36:56.714724 2558903 nvc.c:350] using root /
I1109 19:36:56.714729 2558903 nvc.c:351] using ldcache /etc/ld.so.cache
I1109 19:36:56.714734 2558903 nvc.c:352] using unprivileged user 65534:65534
I1109 19:36:56.714774 2558903 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1109 19:36:56.714846 2558903 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I1109 19:36:56.716665 2558909 nvc.c:278] loading kernel module nvidia
I1109 19:36:56.716824 2558909 nvc.c:282] running mknod for /dev/nvidiactl
I1109 19:36:56.716851 2558909 nvc.c:286] running mknod for /dev/nvidia0
I1109 19:36:56.716867 2558909 nvc.c:286] running mknod for /dev/nvidia1
I1109 19:36:56.716882 2558909 nvc.c:286] running mknod for /dev/nvidia2
I1109 19:36:56.716896 2558909 nvc.c:286] running mknod for /dev/nvidia3
I1109 19:36:56.716911 2558909 nvc.c:286] running mknod for /dev/nvidia4
I1109 19:36:56.716926 2558909 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I1109 19:36:56.722348 2558909 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I1109 19:36:56.722439 2558909 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I1109 19:36:56.725132 2558909 nvc.c:296] loading kernel module nvidia_uvm
I1109 19:36:56.725192 2558909 nvc.c:300] running mknod for /dev/nvidia-uvm
I1109 19:36:56.725255 2558909 nvc.c:305] loading kernel module nvidia_modeset
I1109 19:36:56.725310 2558909 nvc.c:309] running mknod for /dev/nvidia-modeset
I1109 19:36:56.725496 2558910 rpc.c:71] starting driver rpc service
I1109 19:36:56.729632 2558911 rpc.c:71] starting nvcgo rpc service
I1109 19:36:56.730197 2558903 nvc_container.c:240] configuring container with 'compute utility video supervised'
I1109 19:36:56.731494 2558903 nvc_container.c:262] setting pid to 2558855
I1109 19:36:56.731502 2558903 nvc_container.c:263] setting rootfs to /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged
I1109 19:36:56.731507 2558903 nvc_container.c:264] setting owner to 0:0
I1109 19:36:56.731512 2558903 nvc_container.c:265] setting bins directory to /usr/bin
I1109 19:36:56.731517 2558903 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
I1109 19:36:56.731521 2558903 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
I1109 19:36:56.731527 2558903 nvc_container.c:268] setting cudart directory to /usr/local/cuda
I1109 19:36:56.731533 2558903 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative)
I1109 19:36:56.731538 2558903 nvc_container.c:270] setting mount namespace to /proc/2558855/ns/mnt
I1109 19:36:56.731542 2558903 nvc_container.c:272] detected cgroupv1
I1109 19:36:56.731547 2558903 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/devices/docker/a504d55e7a9f34970a6c5eb0f8a69e15aa64fa7e1b75b1da0556cff8fa1da4fe
I1109 19:36:56.731553 2558903 nvc_info.c:766] requesting driver information with ''
I1109 19:36:56.732442 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.141.03
I1109 19:36:56.732486 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.141.03
I1109 19:36:56.732509 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.141.03
I1109 19:36:56.732534 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03
I1109 19:36:56.732570 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03
I1109 19:36:56.732607 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03
I1109 19:36:56.732635 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.470.141.03
I1109 19:36:56.732667 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.141.03
I1109 19:36:56.732689 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03
I1109 19:36:56.732720 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.141.03
I1109 19:36:56.732751 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.141.03
I1109 19:36:56.732795 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.141.03
I1109 19:36:56.732816 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.141.03
I1109 19:36:56.732838 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.141.03
I1109 19:36:56.732869 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03
I1109 19:36:56.732900 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.141.03
I1109 19:36:56.732921 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03
I1109 19:36:56.732944 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03
I1109 19:36:56.732972 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.141.03
I1109 19:36:56.732992 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03
I1109 19:36:56.733023 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03
I1109 19:36:56.733210 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03
I1109 19:36:56.733292 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.141.03
I1109 19:36:56.733314 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.141.03
I1109 19:36:56.733335 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.141.03
I1109 19:36:56.733359 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.141.03
W1109 19:36:56.733390 2558903 nvc_info.c:399] missing library libcudadebugger.so
W1109 19:36:56.733395 2558903 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W1109 19:36:56.733400 2558903 nvc_info.c:399] missing library libnvidia-pkcs11.so
W1109 19:36:56.733405 2558903 nvc_info.c:399] missing library libvdpau_nvidia.so
W1109 19:36:56.733409 2558903 nvc_info.c:403] missing compat32 library libnvidia-ml.so
W1109 19:36:56.733414 2558903 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W1109 19:36:56.733419 2558903 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W1109 19:36:56.733423 2558903 nvc_info.c:403] missing compat32 library libcuda.so
W1109 19:36:56.733428 2558903 nvc_info.c:403] missing compat32 library libcudadebugger.so
W1109 19:36:56.733433 2558903 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W1109 19:36:56.733437 2558903 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W1109 19:36:56.733442 2558903 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W1109 19:36:56.733446 2558903 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W1109 19:36:56.733451 2558903 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W1109 19:36:56.733456 2558903 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W1109 19:36:56.733460 2558903 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W1109 19:36:56.733465 2558903 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W1109 19:36:56.733469 2558903 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W1109 19:36:56.733474 2558903 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W1109 19:36:56.733479 2558903 nvc_info.c:403] missing compat32 library libnvcuvid.so
W1109 19:36:56.733483 2558903 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W1109 19:36:56.733488 2558903 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W1109 19:36:56.733493 2558903 nvc_info.c:403] missing compat32 library libnvidia-tls.so
W1109 19:36:56.733497 2558903 nvc_info.c:403] missing compat32 library libnvidia-glsi.so
W1109 19:36:56.733502 2558903 nvc_info.c:403] missing compat32 library libnvidia-fbc.so
W1109 19:36:56.733506 2558903 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W1109 19:36:56.733511 2558903 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W1109 19:36:56.733523 2558903 nvc_info.c:403] missing compat32 library libnvoptix.so
W1109 19:36:56.733527 2558903 nvc_info.c:403] missing compat32 library libGLX_nvidia.so
W1109 19:36:56.733532 2558903 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W1109 19:36:56.733537 2558903 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W1109 19:36:56.733541 2558903 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so
W1109 19:36:56.733546 2558903 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so
W1109 19:36:56.733550 2558903 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I1109 19:36:56.733745 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I1109 19:36:56.733758 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I1109 19:36:56.733770 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I1109 19:36:56.733788 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I1109 19:36:56.733801 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W1109 19:36:56.733830 2558903 nvc_info.c:425] missing binary nv-fabricmanager
I1109 19:36:56.733849 2558903 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/470.141.03/gsp.bin
I1109 19:36:56.733864 2558903 nvc_info.c:529] listing device /dev/nvidiactl
I1109 19:36:56.733869 2558903 nvc_info.c:529] listing device /dev/nvidia-uvm
I1109 19:36:56.733874 2558903 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I1109 19:36:56.733879 2558903 nvc_info.c:529] listing device /dev/nvidia-modeset
I1109 19:36:56.733896 2558903 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W1109 19:36:56.733911 2558903 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W1109 19:36:56.733921 2558903 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I1109 19:36:56.733926 2558903 nvc_info.c:822] requesting device information with ''
I1109 19:36:56.740598 2558903 nvc_info.c:713] listing device /dev/nvidia4 (GPU-f03eaa43-86e1-e9e9-9611-a7ac07919c57 at 00000000:01:00.0)
I1109 19:36:56.747015 2558903 nvc_info.c:713] listing device /dev/nvidia3 (GPU-07f52695-96d8-e750-2a03-fd133cfc332e at 00000000:47:00.0)
I1109 19:36:56.753315 2558903 nvc_info.c:713] listing device /dev/nvidia2 (GPU-f8ef4daa-4956-327d-fa81-1f9168e99402 at 00000000:81:00.0)
I1109 19:36:56.759505 2558903 nvc_info.c:713] listing device /dev/nvidia1 (GPU-c725c00f-684f-0913-2178-b85c8589f26d at 00000000:c2:00.0)
I1109 19:36:56.759543 2558903 nvc_mount.c:366] mounting tmpfs at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nvidia
I1109 19:36:56.759937 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-smi
I1109 19:36:56.759990 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-debugdump
I1109 19:36:56.760025 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-persiste
nced
I1109 19:36:56.760059 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-cuda
-mps-control
I1109 19:36:56.760091 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-cuda-
mps-server
I1109 19:36:56.760250 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03
I1109 19:36:56.760294 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merge
d/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03
I1109 19:36:56.760339 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merg
ed/usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.470.141.03
I1109 19:36:56.760373 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/
lib/x86_64-linux-gnu/libcuda.so.470.141.03
I1109 19:36:56.760409 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/me
rged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03
I1109 19:36:56.760448 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb0343
20613/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03
I1109 19:36:56.760482 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613
/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03
I1109 19:36:56.760518 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/
merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03
I1109 19:36:56.760556 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/me
rged/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03
I1109 19:36:56.760591 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb0343206
13/merged/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03
I1109 19:36:56.760625 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/u
sr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03
I1109 19:36:56.760645 2558903 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.
so.1
I1109 19:36:56.760668 2558903 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.
so -> libnvidia-opticalflow.so.1
I1109 19:36:56.761145 2558903 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/470.141.03/gsp.bin at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/lib/firmw
are/nvidia/470.141.03/gsp.bin with flags 0x7
I1109 19:36:56.761247 2558903 nvc_mount.c:261] mounting /run/nvidia-persistenced/socket at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/run/nvidia-persisten
ced/socket
I1109 19:36:56.761280 2558903 nvc_mount.c:230] mounting /dev/nvidiactl at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidiactl
I1109 19:36:56.761449 2558903 nvc_mount.c:230] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia-uvm
I1109 19:36:56.761531 2558903 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia-uvm-tools
I1109 19:36:56.761624 2558903 nvc_mount.c:230] mounting /dev/nvidia4 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia4
I1109 19:36:56.761696 2558903 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nv
idia/gpus/0000:01:00.0
I1109 19:36:56.761792 2558903 nvc_mount.c:230] mounting /dev/nvidia3 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia3
I1109 19:36:56.761837 2558903 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:47:00.0 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nv
idia/gpus/0000:47:00.0
I1109 19:36:56.761926 2558903 nvc_mount.c:230] mounting /dev/nvidia2 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia2
I1109 19:36:56.761969 2558903 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:81:00.0 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nv
idia/gpus/0000:81:00.0
I1109 19:36:56.762058 2558903 nvc_mount.c:230] mounting /dev/nvidia1 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia1
I1109 19:36:56.762098 2558903 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:c2:00.0 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nv
idia/gpus/0000:c2:00.0
I1109 19:36:56.762144 2558903 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged
I1109 19:36:56.800505 2558903 nvc.c:434] shutting down library context
I1109 19:36:56.800661 2558911 rpc.c:95] terminating nvcgo rpc service
I1109 19:36:56.801331 2558903 rpc.c:135] nvcgo rpc service terminated successfully
I1109 19:36:56.802516 2558910 rpc.c:95] terminating driver rpc service
I1109 19:36:56.802611 2558903 rpc.c:135] driver rpc service terminated successfully
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
You need to define a runtimeClass named nvidia with handler nvidia.
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
name: nvidia
@hholst80 in which file should I define a runtimeClass ?
@captainsk7 nvidia
runtimeClass is created by default with the gpu-operator and set as default within containerd config. Please provide more details on the error you are seeing. failed to get sandbox runtime: no runtime for "nvidia" is configured
is expected when driver is still getting installed, but should recover after that. Can you provide more details about your config?
- Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?
- OS version
- Containerd config (/etc/containerd/config.toml)
- Status of all pods under gpu-operator namespace
- Logs from init-containers of device-plugin and container-toolkit. (kubectl logs --all-containers -lapp=nvidia-device-plugin-daemonset -n gpu-operator)
@shivamerla thanks for reply,
I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu
I'm getting the same error Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
.
1. Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?
- on both worker nodes the drivers/container-toolkit is pre-installed.
- on controller node its not installed because its non-GPU machine.
2. OS version Ubuntu 20.04.5 LTS
3. Status of all pods under gpu-operator namespace
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-jc4wt 0/1 Init:0/1 0 18h
gpu-feature-discovery-r27zv 0/1 Init:0/1 0 18h
gpu-operator-1673351272-node-feature-discovery-master-65d8hl88v 1/1 Running 0 18h
gpu-operator-1673351272-node-feature-discovery-worker-8j72k 1/1 Running 0 18h
gpu-operator-1673351272-node-feature-discovery-worker-wj5gd 1/1 Running 0 18h
gpu-operator-95b545d6f-r2cnp 1/1 Running 0 18h
nvidia-container-toolkit-daemonset-lg79g 1/1 Running 0 18h
nvidia-container-toolkit-daemonset-q26kq 1/1 Running 0 18h
nvidia-dcgm-exporter-2vpwj 0/1 Init:0/1 0 18h
nvidia-dcgm-exporter-gx6dv 0/1 Init:0/1 0 18h
nvidia-device-plugin-daemonset-tbbgb 0/1 Init:0/1 0 18h
nvidia-device-plugin-daemonset-z29kx 0/1 Init:0/1 0 18h
nvidia-operator-validator-79s4j 0/1 Init:0/4 0 18h
nvidia-operator-validator-thbq2 0/1 Init:0/4 0 18h
4. Logs from init-containers
from device-plugin
Error from server (BadRequest): container "toolkit-validation" in pod "nvidia-device-plugin-daemonset-tbbgb" is waiting to start: PodInitializing
from container-toolkit
time="2023-01-10T11:57:43Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:57:43Z" level=info msg="Config version: 2"
time="2023-01-10T11:57:43Z" level=info msg="Updating config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully updated config"
time="2023-01-10T11:57:43Z" level=info msg="Flushing config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:57:43Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:57:43Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:57:43Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:57:43Z" level=info msg="Waiting for signal"
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
time="2023-01-10T11:51:53Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:51:53Z" level=info msg="Config version: 2"
time="2023-01-10T11:51:53Z" level=info msg="Updating config"
time="2023-01-10T11:51:53Z" level=info msg="Successfully updated config"
time="2023-01-10T11:51:53Z" level=info msg="Flushing config"
time="2023-01-10T11:51:53Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:51:53Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:51:53Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:51:53Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:51:53Z" level=info msg="Waiting for signal"
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1601 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1601 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1736 G /usr/bin/gnome-shell 8MiB |
+-----------------------------------------------------------------------------+
@captainsk7 can you get output of kubectl logs nvidia-operator-validator-79s4j -n gpu-operator -c driver-validation
and also double check "nvidia" runtime settings are correct in /etc/containerd/config.toml?
@shivamerla ouput of kubectl logs nvidia-operator-validator-79s4j -n gpu-operator -c driver-validation
is
Error from server (BadRequest): container "driver-validation" in pod "nvidia-operator-validator-79s4j" is waiting to start: PodInitializing
and "nvidia" runtime settings are
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
The config looks good, somehow containerd might not be picking up this change. Did you confirm that the right config file (/etc/k0s/containerd.toml) used by containerd is changed? Please try restart of containerd service as well to confirm.
@shivamerla yes, the file (/etc/k0s/containerd.toml) is changed on both worker nodes
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
@shivamerla reply please
@captainsk7 We have to debug couple of things.
- If nvidia-container-runtime is getting invoked at all. This can be debugged by following steps here.
- If containerd is not picking up config file changes.
2.1. This is done by confirming the correct config is setup and restart of containerd.
2.2. We can spin up additional sample pods with
runtimeClass
set tonvidia
to verify if container can start. 2.3. Create a complete containerd config file using "containerd config default > /etc/k0s/containerd.toml" and then install GPU Operator. This is to confirm if all required fields are set. SystemdCgroup=true is required for K8s 1.25 and above.
I am not too familiar with K0s and not something we test internally. But above steps will ensure the runtime is setup correctly.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
This is not an error and expected until drivers are loaded andnvidia-container-runtime
is setup. With driver containers, it takes about 3-4minutes to setup/load drivers and toolkit, after that these errors should go away. The reason these errors appear is we run some operands(device-plugin, gfd, dcgm etc) withruntimeClass
set tonvidia
, and this will cause above error until drivers/toolkit are ready.
I'd expect gpu-operator should deploy nvidia-driver-daemonset
first before everything else, once all of the Pods get into Running
state, it should deploy other daemonsets. Otherwise, race condition occurs in case other daemonsets un and running: failed to get sandbox runtime: no runtime for "nvidia" is configured
.
cc @shivamerla
In our case, the cluster is up and running for a few days where everything with nvidia is working. But suddenly after a few days we get an issue where the pods dont see nvidia-smi
. Only after restarting the nvidia-driver-daemonset
the issue gets resolved. When we restart the nvidia-driver-daemonset
it throws the above error on another nvidia-
pod.
Did anyone face the same issue?
Getting same issue on a kubeadm setup. Need help
I faced similar problems but using microk8s
on a single DGX machine having H100. The problem for me that the default cuds-operators didn't support H100 with the latest nvidia driver. I was hinted from this github issue, and that fix worked for me.