k8s-device-plugin
k8s-device-plugin copied to clipboard
Pod stuck 'pending', nvidia-device-plugin consuming 100% CPU
1. Issue or feature description
nvidia-device-plugin
sits using 100% CPU when a new Pod with a GPU requirement is scheduled. The Pod is stuck as 'pending', with no further failure or error - either from the container or from Kubernetes itself.
Commands on the host such asnvidia-smi
work prior to scheduling a Pod with a GPU requirement. Once this behaviour is triggered, I'm no longer able to run such commands until the host is rebooted.
2. Steps to reproduce the issue
Kubernetes cluster is K3s, version v1.22.9+k3s1
.
Cluster has seven nodes - three server, three worker, and a fourth worker with a pair of GPUs - A100s. All nodes are running Ubuntu 20.04 with kernel 5.4.0-109-generic
. They're virtual machines, with the GPU VM being provided with the GPUs via PCI pass-through (see output below from nvidia-smi
).
The GPU node has nvidia-container-toolkit version 1.9.0-1
installed along with nvidia-driver-470-server version 470.103.01-0ubuntu0.20.04.1
.
Once the cluster is up, NFD is deployed with kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.11.0"
With NFD, the device plugin is installed by templating the Helm install and adding a nodeSelector
with the PCI device corresponding with node that has the GPUs:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: release-name-nvidia-device-plugin
namespace: kube-system
labels:
helm.sh/chart: nvidia-device-plugin-0.11.0
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: release-name
app.kubernetes.io/version: "0.11.0"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: release-name
updateStrategy:
type: RollingUpdate
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: release-name
spec:
priorityClassName: "system-node-critical"
nodeSelector:
feature.node.kubernetes.io/pci-0302_10de.present: "true"
runtimeClassName: nvidia
securityContext:
{}
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.11.0
imagePullPolicy: IfNotPresent
name: nvidia-device-plugin-ctr
args:
- "--mig-strategy=none"
- "--pass-device-specs=false"
- "--fail-on-init-error=true"
- "--device-list-strategy=envvar"
- "--device-id-strategy=uuid"
- "--nvidia-driver-root=/"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
Once the plugin has deployed, the node is successfully updated to reflect the available GPUs:
$ kubectl get node sandbox-worker-gpu-instance1 -o jsonpath="{.status.allocatable}"
{"cpu":"32","ephemeral-storage":"39369928059","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"263940608Ki","nvidia.com/gpu":"2","pods":"110"}
Attempting to deploy a test Pod that targets this node then triggers the problem:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 2
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Pending 0 2m45s
$ kubectl describe pod cuda-vector-add
Name: cuda-vector-add
Namespace: default
Priority: 0
Node: sandbox-worker-gpu-instance1/
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
cuda-vector-add:
Image: k8s.gcr.io/cuda-vector-add:v0.1
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 2
Requests:
nvidia.com/gpu: 2
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nnnzt (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
kube-api-access-nnnzt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m56s default-scheduler Successfully assigned default/cuda-vector-add to sandbox-worker-gpu-instance1
$ kubectl get events | head -3
LAST SEEN TYPE REASON OBJECT MESSAGE
51m Normal Scheduled pod/cuda-vector-add Successfully assigned default/cuda-vector-add to sandbox-worker-gpu-instance1
4m18s Normal Scheduled pod/cuda-vector-add Successfully assigned default/cuda-vector-add to sandbox-worker-gpu-instance1
There are no additional logs from the nvidia-device-plugin
container.
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [X] The output of
nvidia-smi -a
on your host:
==============NVSMI LOG==============
Timestamp : Thu May 5 12:29:57 2022
Driver Version : 470.103.01
CUDA Version : 11.4
Attached GPUs : 2
GPU 00000000:00:05.0
Product Name : NVIDIA A100-SXM4-80GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1560221017003
GPU UUID : GPU-6d777e6b-bc4d-212a-301c-82001966b4f0
Minor Number : 0
VBIOS Version : 92.00.36.00.10
MultiGPU Board : No
Board ID : 0x5
GPU Part Number : 692-2G506-0212-002
Module ID : 2
Inforom Version
Image Version : G506.0212.00.01
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x05
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:00:05.0
Sub System Id : 0x147F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81251 MiB
Used : 0 MiB
Free : 81251 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 37 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 54 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 73.30 W
Power Limit : 500.00 W
Default Power Limit : 500.00 W
Enforced Power Limit : 500.00 W
Min Power Limit : 100.00 W
Max Power Limit : 500.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1593 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1275 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1275 MHz
Memory : 1593 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 743.750 mV
Processes : None
GPU 00000000:00:06.0
Product Name : NVIDIA A100-SXM4-80GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1560221017105
GPU UUID : GPU-1df39da7-ba1a-c950-de6d-394162582846
Minor Number : 1
VBIOS Version : 92.00.36.00.10
MultiGPU Board : No
Board ID : 0x6
GPU Part Number : 692-2G506-0212-002
Module ID : 3
Inforom Version
Image Version : G506.0212.00.01
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x06
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:00:06.0
Sub System Id : 0x147F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81251 MiB
Used : 0 MiB
Free : 81251 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 31 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 48 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 72.09 W
Power Limit : 500.00 W
Default Power Limit : 500.00 W
Enforced Power Limit : 500.00 W
Min Power Limit : 100.00 W
Max Power Limit : 500.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1593 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1275 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1275 MHz
Memory : 1593 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 750.000 mV
Processes : None
- [X] Your docker configuration file (e.g:
/etc/docker/daemon.json
)
The equivalent here is the containerd configuration:
[plugins.opt]
path = "/var/lib/rancher/k3s/agent/containerd"
[plugins.cri]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
sandbox_image = "rancher/mirrored-pause:3.6"
[plugins.cri.containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
[plugins.cri.cni]
bin_dir = "/var/lib/rancher/k3s/data/995f5a281daabc1838b33f2346f7c4976b95f449c703b6f1f55b981966eba456/bin"
conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
- [X] The k8s-device-plugin container logs
2022/05/05 12:29:36 Loading NVML
2022/05/05 12:29:45 Starting FS watcher.
2022/05/05 12:29:45 Starting OS watcher.
2022/05/05 12:29:45 Retreiving plugins.
2022/05/05 12:29:45 Starting GRPC server for 'nvidia.com/gpu'
2022/05/05 12:29:45 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/05/05 12:29:45 Registered device plugin for 'nvidia.com/gpu' with Kubelet
- [X] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
May 05 12:29:18 sandbox-worker-gpu-instance1 k3s[1106]: time="2022-05-05T12:29:18Z" level=info msg="Running kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --cni-bin-dir=/var/lib/rancher/k3s/data/995f5a281daabc1838b33f2346f7c4976b95f449c703b6f1f55b981966eba456/bin --cni-conf-dir=/var/lib/rancher/k3s/agent/etc/cni/net.d --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --container-runtime=remote --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=sandbox-worker-gpu-instance1 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --node-labels= --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/run/systemd/resolve/resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key"
May 05 12:29:18 sandbox-worker-gpu-instance1 k3s[1106]: Flag --cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
May 05 12:29:18 sandbox-worker-gpu-instance1 k3s[1106]: Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
May 05 12:29:18 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:18.439787 1106 server.go:436] "Kubelet version" kubeletVersion="v1.22.9+k3s1"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.478082 1106 container_manager_linux.go:285] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.479668 1106 kubelet.go:418] "Attempting to sync node with API server"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.480113 1106 kubelet.go:279] "Adding static pod path" path="/var/lib/rancher/k3s/agent/pod-manifests"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.480135 1106 kubelet.go:290] "Adding apiserver pod source"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.485114 1106 server.go:1213] "Started kubelet"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.486909 1106 volume_manager.go:291] "Starting Kubelet Volume Manager"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: E0505 12:29:23.487046 1106 kubelet.go:1343] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.499744 1106 server.go:409] "Adding debug handlers to kubelet server"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.533108 1106 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv4
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.547350 1106 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv6
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.547407 1106 kubelet.go:2006] "Starting kubelet main sync loop"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: E0505 12:29:23.547483 1106 kubelet.go:2030] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.568188 1106 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.588528 1106 kubelet_network.go:76] "Updating Pod CIDR" originalPodCIDR="" newPodCIDR="10.42.6.0/24"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.590636 1106 kubelet_node_status.go:71] "Attempting to register node" node="sandbox-worker-gpu-instance1"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.603663 1106 kubelet_node_status.go:109] "Node was previously registered" node="sandbox-worker-gpu-instance1"
May 05 12:29:23 sandbox-worker-gpu-instance1 k3s[1106]: I0505 12:29:23.603744 1106 kubelet_node_status.go:74] "Successfully registered node" node="sandbox-worker-gpu-instance1"
- [X] NVIDIA container library logs (see troubleshooting)
I0505 13:17:23.667007 3793 rpc.c:71] starting nvcgo rpc service
I0505 13:17:23.668270 3723 nvc_container.c:240] configuring container with 'utility supervised'
I0505 13:17:23.669843 3723 nvc_container.c:262] setting pid to 3717
I0505 13:17:23.669861 3723 nvc_container.c:263] setting rootfs to /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs
I0505 13:17:23.669867 3723 nvc_container.c:264] setting owner to 0:0
I0505 13:17:23.669876 3723 nvc_container.c:265] setting bins directory to /usr/bin
I0505 13:17:23.669881 3723 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
I0505 13:17:23.669890 3723 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
I0505 13:17:23.669899 3723 nvc_container.c:268] setting cudart directory to /usr/local/cuda
I0505 13:17:23.669906 3723 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0505 13:17:23.669912 3723 nvc_container.c:270] setting mount namespace to /proc/3717/ns/mnt
I0505 13:17:23.669919 3723 nvc_container.c:272] detected cgroupv1
I0505 13:17:23.669926 3723 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/devices/kubepods/besteffort/pod9efe2b12-595a-4c81-aa50-591805794679/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282
I0505 13:17:23.669939 3723 nvc_info.c:765] requesting driver information with ''
I0505 13:17:23.671291 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.103.01
I0505 13:17:23.671348 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.103.01
I0505 13:17:23.671384 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.103.01
I0505 13:17:23.671420 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.103.01
I0505 13:17:23.671463 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.103.01
I0505 13:17:23.671505 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.103.01
I0505 13:17:23.671538 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.103.01
I0505 13:17:23.671572 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01
I0505 13:17:23.671612 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.103.01
I0505 13:17:23.671654 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.103.01
I0505 13:17:23.671728 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.103.01
I0505 13:17:23.671761 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.103.01
I0505 13:17:23.671796 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.103.01
I0505 13:17:23.671842 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.103.01
I0505 13:17:23.671884 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.103.01
I0505 13:17:23.671920 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.103.01
I0505 13:17:23.671954 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.103.01
I0505 13:17:23.671994 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.103.01
I0505 13:17:23.672030 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.103.01
I0505 13:17:23.672073 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.103.01
I0505 13:17:23.672172 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.103.01
I0505 13:17:23.672247 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.103.01
I0505 13:17:23.672286 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.103.01
I0505 13:17:23.672324 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.103.01
I0505 13:17:23.672359 3723 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.103.01
W0505 13:17:23.672381 3723 nvc_info.c:398] missing library libnvidia-nscq.so
W0505 13:17:23.672394 3723 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so
W0505 13:17:23.672401 3723 nvc_info.c:398] missing library libnvidia-pkcs11.so
W0505 13:17:23.672408 3723 nvc_info.c:398] missing library libvdpau_nvidia.so
W0505 13:17:23.672415 3723 nvc_info.c:402] missing compat32 library libnvidia-ml.so
W0505 13:17:23.672421 3723 nvc_info.c:402] missing compat32 library libnvidia-cfg.so
W0505 13:17:23.672428 3723 nvc_info.c:402] missing compat32 library libnvidia-nscq.so
W0505 13:17:23.672434 3723 nvc_info.c:402] missing compat32 library libcuda.so
W0505 13:17:23.672443 3723 nvc_info.c:402] missing compat32 library libnvidia-opencl.so
W0505 13:17:23.672450 3723 nvc_info.c:402] missing compat32 library libnvidia-ptxjitcompiler.so
W0505 13:17:23.672456 3723 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so
W0505 13:17:23.672464 3723 nvc_info.c:402] missing compat32 library libnvidia-allocator.so
W0505 13:17:23.672469 3723 nvc_info.c:402] missing compat32 library libnvidia-compiler.so
W0505 13:17:23.672474 3723 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so
W0505 13:17:23.672480 3723 nvc_info.c:402] missing compat32 library libnvidia-ngx.so
W0505 13:17:23.672486 3723 nvc_info.c:402] missing compat32 library libvdpau_nvidia.so
W0505 13:17:23.672490 3723 nvc_info.c:402] missing compat32 library libnvidia-encode.so
W0505 13:17:23.672496 3723 nvc_info.c:402] missing compat32 library libnvidia-opticalflow.so
W0505 13:17:23.672502 3723 nvc_info.c:402] missing compat32 library libnvcuvid.so
W0505 13:17:23.672508 3723 nvc_info.c:402] missing compat32 library libnvidia-eglcore.so
W0505 13:17:23.672513 3723 nvc_info.c:402] missing compat32 library libnvidia-glcore.so
W0505 13:17:23.672519 3723 nvc_info.c:402] missing compat32 library libnvidia-tls.so
W0505 13:17:23.672525 3723 nvc_info.c:402] missing compat32 library libnvidia-glsi.so
W0505 13:17:23.672529 3723 nvc_info.c:402] missing compat32 library libnvidia-fbc.so
W0505 13:17:23.672535 3723 nvc_info.c:402] missing compat32 library libnvidia-ifr.so
W0505 13:17:23.672541 3723 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so
W0505 13:17:23.672548 3723 nvc_info.c:402] missing compat32 library libnvoptix.so
W0505 13:17:23.672554 3723 nvc_info.c:402] missing compat32 library libGLX_nvidia.so
W0505 13:17:23.672569 3723 nvc_info.c:402] missing compat32 library libEGL_nvidia.so
W0505 13:17:23.672575 3723 nvc_info.c:402] missing compat32 library libGLESv2_nvidia.so
W0505 13:17:23.672581 3723 nvc_info.c:402] missing compat32 library libGLESv1_CM_nvidia.so
W0505 13:17:23.672587 3723 nvc_info.c:402] missing compat32 library libnvidia-glvkspirv.so
W0505 13:17:23.672593 3723 nvc_info.c:402] missing compat32 library libnvidia-cbl.so
I0505 13:17:23.672920 3723 nvc_info.c:298] selecting /usr/bin/nvidia-smi
I0505 13:17:23.672941 3723 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump
I0505 13:17:23.672957 3723 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced
I0505 13:17:23.672981 3723 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control
I0505 13:17:23.672998 3723 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server
W0505 13:17:23.673094 3723 nvc_info.c:424] missing binary nv-fabricmanager
I0505 13:17:23.673120 3723 nvc_info.c:342] listing firmware path /usr/lib/firmware/nvidia/470.103.01/gsp.bin
I0505 13:17:23.673145 3723 nvc_info.c:528] listing device /dev/nvidiactl
I0505 13:17:23.673157 3723 nvc_info.c:528] listing device /dev/nvidia-uvm
I0505 13:17:23.673164 3723 nvc_info.c:528] listing device /dev/nvidia-uvm-tools
I0505 13:17:23.673170 3723 nvc_info.c:528] listing device /dev/nvidia-modeset
I0505 13:17:23.673191 3723 nvc_info.c:342] listing ipc path /run/nvidia-persistenced/socket
W0505 13:17:23.673211 3723 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket
W0505 13:17:23.673230 3723 nvc_info.c:348] missing ipc path /tmp/nvidia-mps
I0505 13:17:23.673244 3723 nvc_info.c:821] requesting device information with ''
I0505 13:17:23.683518 3723 nvc_info.c:712] listing device /dev/nvidia0 (GPU-6d777e6b-bc4d-212a-301c-82001966b4f0 at 00000000:00:05.0)
I0505 13:17:23.689564 3723 nvc_info.c:712] listing device /dev/nvidia1 (GPU-1df39da7-ba1a-c950-de6d-394162582846 at 00000000:00:06.0)
I0505 13:17:23.689653 3723 nvc_mount.c:366] mounting tmpfs at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/proc/driver/nvidia
I0505 13:17:23.690086 3723 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/bin/nvidia-smi
I0505 13:17:23.690137 3723 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/bin/nvidia-debugdump
I0505 13:17:23.690178 3723 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/bin/nvidia-persistenced
I0505 13:17:23.690289 3723 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01
I0505 13:17:23.690333 3723 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.103.01 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.103.01
I0505 13:17:23.690448 3723 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/470.103.01/gsp.bin at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/lib/firmware/nvidia/470.103.01/gsp.bin with flags 0x7
I0505 13:17:23.690520 3723 nvc_mount.c:261] mounting /run/nvidia-persistenced/socket at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/run/nvidia-persistenced/socket
I0505 13:17:23.690565 3723 nvc_mount.c:230] mounting /dev/nvidiactl at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/dev/nvidiactl
I0505 13:17:23.690853 3723 nvc_mount.c:230] mounting /dev/nvidia0 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/dev/nvidia0
I0505 13:17:23.690914 3723 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:00:05.0 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/proc/driver/nvidia/gpus/0000:00:05.0
I0505 13:17:23.691008 3723 nvc_mount.c:230] mounting /dev/nvidia1 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/dev/nvidia1
I0505 13:17:23.691062 3723 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:00:06.0 at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs/proc/driver/nvidia/gpus/0000:00:06.0
I0505 13:17:23.691150 3723 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/65da24a0d6db7df0a5b26d7eb77d78afcc35315c0453b23bca0410d3fdc3d282/rootfs
I0505 13:17:23.720418 3723 nvc.c:430] shutting down library context
I0505 13:17:23.720546 3793 rpc.c:95] terminating nvcgo rpc service
I0505 13:17:23.721419 3723 rpc.c:135] nvcgo rpc service terminated successfully
I0505 13:17:24.227416 3730 rpc.c:95] terminating driver rpc service
I0505 13:17:24.227730 3723 rpc.c:135] driver rpc service terminated successfully
2022/05/05 12:52:07 Using bundle directory:
2022/05/05 12:52:07 Using OCI specification file path: config.json
2022/05/05 12:52:07 Looking for runtime binary 'docker-runc'
2022/05/05 12:52:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2022/05/05 12:52:07 Looking for runtime binary 'runc'
2022/05/05 12:52:07 Found runtime binary '/var/lib/rancher/k3s/data/995f5a281daabc1838b33f2346f7c4976b95f449c703b6f1f55b981966eba456/bin/runc'
2022/05/05 12:52:07 Running /usr/bin/nvidia-container-runtime
2022/05/05 12:52:07 No modification required
2022/05/05 12:52:07 Forwarding command to runtime
Did you manage to find a solution for this one @yankcrime ?
@TornjV Not directly no, the problem eventually went away with some combination of OS or driver update and I've not had it reoccur since.
Same situation here, although not knowing why and if it could happen again is a bit annoying 😄
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.