k8s-device-plugin
k8s-device-plugin copied to clipboard
When the server restarts, container had no permission to access nvidia-uvm/nvidia-uvm-tools/nvidia-modeset device
/kind bug /kind cgroup
1. Issue or feature description
When the server restarts, nvidia-device-plugin-daemonset cannot load the nvidia-uvm/nvidia-modeset device correctly
2. Steps to reproduce the issue
-
Config kubelet with "static" CpuManagerPolicy and "single-numa-node" TopologyManagerPolicy
root@k8s-t4-node:~$ cat /etc/kubernetes/kubelet-config.yaml apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration nodeStatusUpdateFrequency: "10s" failSwapOn: True authentication: anonymous: enabled: false webhook: enabled: True x509: clientCAFile: /etc/kubernetes/ssl/ca.crt authorization: mode: Webhook staticPodPath: /etc/kubernetes/manifests cgroupDriver: cgroupfs maxPods: 110 address: 10.0.0.34 readOnlyPort: 10255 healthzPort: 10248 healthzBindAddress: 127.0.0.1 kubeletCgroups: /systemd/system.slice clusterDomain: cluster.local protectKernelDefaults: true rotateCertificates: true clusterDNS: - 169.254.25.10 systemReserved: cpu: 2000m memory: 2G kubeReserved: cpu: 2000m memory: 2G cpuManagerPolicy: "static" cpuManagerReconcilePeriod: 10s topologyManagerPolicy: "single-numa-node" resolvConf: "/etc/resolv.conf"
-
Deploy nvidia-device-plugin with cpumanager
root@k8s-master-node:~$ kubectl apply -f nvidia-device-plugin-compat-with-cpumanager.yml
-
Deploy test pod with cpumanager and topology manager
apiVersion: apps/v1 kind: Deployment metadata: name: k8s-sv0-test namespace: default spec: replicas: 8 selector: matchLabels: appid: k8s-sv0-test strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: labels: appid: k8s-sv0-test name: k8s-sv0-test spec: containers: - command: - /bin/bash - -c - while true; do echo $(date) && sleep 1; done image: nvidia/cuda:10.0-devel-centos7 imagePullPolicy: IfNotPresent name: k8s-sv0-test-container-1 resources: limits: cpu: "8" memory: 40Gi nvidia.com/gpu: "1" requests: cpu: "8" memory: 40Gi nvidia.com/gpu: "1" securityContext: capabilities: add: - ALL privileged: false volumeMounts: - mountPath: /mnt/tools name: tools dnsPolicy: ClusterFirst hostIPC: true hostNetwork: true nodeSelector: accelerate-type=nvidia priorityClassName: default-priority restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /mnt/tools type: "" name: tools
-
Check the GPU device status in the container
root@k8s-master-node:~$ kubectl get pods -l 'appid=k8s-sv0-test' NAME READY STATUS RESTARTS AGE k8s-sv0-test-7c4bcdbc4-52c5g 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-5mx2g 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-7jhqb 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-j6mqh 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-jg5m9 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-p86lp 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-rt7ct 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-xzf5p 1/1 Running 0 13h root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- lsmod | grep nvidia nvidia_uvm 966698 0 nvidia_drm 48653 0 nvidia_modeset 1177123 1 nvidia_drm nvidia 19683928 156 nvidia_modeset,nvidia_uvm drm_kms_helper 179394 2 ast,nvidia_drm drm 429744 5 ast,ttm,drm_kms_helper,nvidia_drm root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- ls -t -l /dev | grep nvidia crw-rw-rw- 1 root root 229, 0 Jan 13 20:41 nvidia-uvm crw-rw-rw- 1 root root 229, 1 Jan 13 20:41 nvidia-uvm-tools crw-rw-rw- 1 root root 195, 254 Jan 13 20:25 nvidia-modeset crw-rw-rw- 1 root root 195, 1 Jan 13 14:56 nvidia1 crw-rw-rw- 1 root root 195, 255 Jan 13 14:56 nvidiactl root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- nvidia-smi topo -m GPU0 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X NODE NODE 0-23,48-71 0 mlx5_0 NODE X PIX mlx5_1 NODE PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- nvidia-smi --query-gpu=gpu_name,gpu_uuid,gpu_bus_id,vbios_version --format=csv name, uuid, pci.bus_id, vbios_version Tesla T4, GPU-aa2cef33-2b58-a1ce-fb6a-e8b9808663ae, 00000000:3E:00.0, 90.04.38.00.03 root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- cat /sys/bus/pci/devices/0000\:3e\:00.0/numa_node 0 root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- /mnt/tools/samples/0_Simple/matrixMul/matrixMul [Matrix Multiply Using CUDA] - Starting... GPU Device 0: "Tesla T4" with compute capability 7.5 MatrixA(320,320), MatrixB(640,320) Computing result using CUDA Kernel... done Performance= 406.82 GFlop/s, Time= 0.322 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled. root@k8s-t4-node:~$ docker ps | grep k8s-sv0-test-7c4bcdbc4-52c5g | grep -v pause cc8153dc4cac 28875b98a13a "/bin/sh -c while true…" 7 minutes ago Up 7 minutes k8s_k8s-sv0-test-container-1_k8s-sv0-test-52c5g-6h994_default_7f8f5d67-9bab-4e5d-a6bc-d9e4fa1a7bd1_0 root@k8s-t4-node:~$ docker inspect cc8153dc4cac | jq '.[].HostConfig.Devices' | tr -d '\n| ' [{"PathOnHost":"/dev/nvidiactl","PathInContainer":"/dev/nvidiactl","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia-uvm","PathInContainer":"/dev/nvidia-uvm","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia-uvm-tools","PathInContainer":"/dev/nvidia-uvm-tools","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia-modeset","PathInContainer":"/dev/nvidia-modeset","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia2","PathInContainer":"/dev/nvidia2","CgroupPermissions":"rw"}] root@k8s-t4-node:~$ docker exec -i cc8153dc4cac /bin/bash -c "cat /proc/self/cgroup | grep device" 2:devices:/kubepods/pod7f8f5d67-9bab-4e5d-a6bc-d9e4fa1a7bd1/cc8153dc4cac755031cadc0e2748baee59f382ea3d0834e0130f2bb197330591 root@k8s-t4-node:~$ cat /sys/fs/cgroup/devices/kubepods/pod7f8f5d67-9bab-4e5d-a6bc-d9e4fa1a7bd1/cc8153dc4cac755031cadc0e2748baee59f382ea3d0834e0130f2bb197330591/devices.list b *:* m c *:* m c 1:3 rwm c 1:5 rwm c 1:7 rwm c 1:8 rwm c 1:9 rwm c 5:0 rwm c 5:1 rwm c 5:2 rwm c 10:200 rwm c 136:* rwm c 195:2 rw c 195:254 rw c 195:255 rw c 229:0 rw c 229:1 rw
-
Reboot server and recheck the GPU device status in the container
root@k8s-master-node:~$ kubectl get pods -l 'appid=k8s-sv0-test' NAME READY STATUS RESTARTS AGE k8s-sv0-test-7c4bcdbc4-8l46s 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-96svs 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-fgls4 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-jzfgp 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-kz72h 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-tw9nj 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-vbg5g 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-xv595 1/1 Running 1 6m40s root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- lsmod | grep nvidia nvidia_uvm 966698 0 nvidia_drm 48653 0 nvidia_modeset 1177123 1 nvidia_drm nvidia 19683928 156 nvidia_modeset,nvidia_uvm drm_kms_helper 179394 2 ast,nvidia_drm drm 429744 5 ast,ttm,drm_kms_helper,nvidia_drm root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- ls -t -l /dev | grep nvidia crw-rw-rw- 1 root root 195, 254 Jan 14 11:18 nvidia-modeset crw-rw-rw- 1 root root 229, 0 Jan 14 11:18 nvidia-uvm crw-rw-rw- 1 root root 229, 1 Jan 14 11:18 nvidia-uvm-tools crw-rw-rw- 1 root root 195, 4 Jan 14 11:18 nvidia4 crw-rw-rw- 1 root root 195, 255 Jan 14 11:18 nvidiactl root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- nvidia-smi topo -m GPU0 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X SYS SYS 24-47,72-95 1 mlx5_0 SYS X PIX mlx5_1 SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- nvidia-smi --query-gpu=gpu_name,gpu_uuid,gpu_bus_id,vbios_version --format=csv name, uuid, pci.bus_id, vbios_version Tesla T4, GPU-5678bde6-66f9-b29c-a344-b421c9896d0b, 00000000:B1:00.0, 90.04.38.00.03 root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- cat /sys/bus/pci/devices/0000\:b1\:00.0/numa_node 1 root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- /mnt/tools/samples/0_Simple/matrixMul/matrixMul [Matrix Multiply Using CUDA] - Starting... CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)" command terminated with exit code 1 root@k8s-t4-node:~$ docker ps | grep k8s-sv0-test-7c4bcdbc4-8l46s | grep -v pause caebc32e21de 28875b98a13a "/bin/sh -c /nfs/pro…" 3 hours ago Up 3 hours k8s_k8s-sv0-test-container-1_k8s-sv0-test-7c4bcdbc4-8l46s_default_f18597a0-4f05-4cf4-a454-e57370cbde6c_6 root@k8s-t4-node:~$ docker inspect caebc32e21de | jq '.[].HostConfig.Devices' | tr -d '\n| ' [{"PathOnHost":"/dev/nvidiactl","PathInContainer":"/dev/nvidiactl","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia4","PathInContainer":"/dev/nvidia4","CgroupPermissions":"rw"}] root@k8s-t4-node:~$ docker exec -i caebc32e21de /bin/bash -c "cat /proc/self/cgroup | grep device" 2:devices:/kubepods/podf18597a0-4f05-4cf4-a454-e57370cbde6c/caebc32e21de39bdbdcbbc3b1d8b1417bbc13afb85618871d558083be5886081 root@k8s-t4-node:~$ cat /sys/fs/cgroup/devices/kubepods/podf18597a0-4f05-4cf4-a454-e57370cbde6c/caebc32e21de39bdbdcbbc3b1d8b1417bbc13afb85618871d558083be5886081/devices.list b *:* m c *:* m c 1:3 rwm c 1:5 rwm c 1:7 rwm c 1:8 rwm c 1:9 rwm c 5:0 rwm c 5:1 rwm c 5:2 rwm c 10:200 rwm c 136:* rwm c 195:4 rw c 195:255 rw
-
After reboot server, kubelet restart contaienr, container had no permission to access nvidia-uvm device
root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- strace -f /mnt/tools/samples/0_Simple/matrixMul/matrixMul execve("/mnt/tools/samples/0_Simple/matrixMul/matrixMul", ["/mnt/tools/samples/0_Simple/mat"...], 0x7fff52bd8228 /* 32 vars */) = 0 brk(NULL) = 0x1b5d000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dce000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/usr/local/cuda/extras/CUPTI/lib64/tls/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/cuda/extras/CUPTI/lib64/tls/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/cuda/extras/CUPTI/lib64/tls/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/cuda/extras/CUPTI/lib64/tls", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/cuda/extras/CUPTI/lib64/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/cuda/extras/CUPTI/lib64/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/cuda/extras/CUPTI/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/cuda/extras/CUPTI/lib64", {st_mode=S_IFDIR|0755, st_size=75, ...}) = 0 open("/usr/local/nvidia/lib/tls/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib/tls/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib/tls/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib/tls", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib64/tls/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib64/tls/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib64/tls/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib64/tls", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib64/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib64/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=54538, ...}) = 0 mmap(NULL, 54538, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3487dc0000 close(3) = 0 open("/usr/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340!\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=43776, ...}) = 0 mmap(NULL, 2128920, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f34879a6000 mprotect(0x7f34879ad000, 2093056, PROT_NONE) = 0 mmap(0x7f3487bac000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f3487bac000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260l\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=141968, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dbf000 mmap(NULL, 2208904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f348778a000 mprotect(0x7f34877a1000, 2093056, PROT_NONE) = 0 mmap(0x7f34879a0000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f34879a0000 mmap(0x7f34879a2000, 13448, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f34879a2000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\r\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=19288, ...}) = 0 mmap(NULL, 2109712, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3487586000 mprotect(0x7f3487588000, 2097152, PROT_NONE) = 0 mmap(0x7f3487788000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f3487788000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\265\5\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=995840, ...}) = 0 mmap(NULL, 3175456, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f348727e000 mprotect(0x7f3487367000, 2097152, PROT_NONE) = 0 mmap(0x7f3487567000, 40960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xe9000) = 0x7f3487567000 mmap(0x7f3487571000, 82976, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3487571000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20S\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1137024, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dbe000 mmap(NULL, 3150120, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3486f7c000 mprotect(0x7f348707d000, 2093056, PROT_NONE) = 0 mmap(0x7f348727c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x100000) = 0x7f348727c000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360*\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=88720, ...}) = 0 mmap(NULL, 2184192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3486d66000 mprotect(0x7f3486d7b000, 2093056, PROT_NONE) = 0 mmap(0x7f3486f7a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7f3486f7a000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340$\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=2151672, ...}) = 0 mmap(NULL, 3981792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3486999000 mprotect(0x7f3486b5b000, 2097152, PROT_NONE) = 0 mmap(0x7f3486d5b000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c2000) = 0x7f3486d5b000 mmap(0x7f3486d61000, 16864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3486d61000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dbd000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dbc000 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dba000 arch_prctl(ARCH_SET_FS, 0x7f3487dba740) = 0 mprotect(0x7f3486d5b000, 16384, PROT_READ) = 0 mprotect(0x7f3486f7a000, 4096, PROT_READ) = 0 mprotect(0x7f348727c000, 4096, PROT_READ) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487db9000 mprotect(0x7f3487567000, 32768, PROT_READ) = 0 mprotect(0x7f3487788000, 4096, PROT_READ) = 0 mprotect(0x7f34879a0000, 4096, PROT_READ) = 0 mprotect(0x7f3487bac000, 4096, PROT_READ) = 0 mprotect(0x68d000, 12288, PROT_READ) = 0 mprotect(0x7f3487dcf000, 4096, PROT_READ) = 0 munmap(0x7f3487dc0000, 54538) = 0 set_tid_address(0x7f3487dbaa10) = 252 set_robust_list(0x7f3487dbaa20, 24) = 0 rt_sigaction(SIGRTMIN, {sa_handler=0x7f3487790790, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f34877995d0}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {sa_handler=0x7f3487790820, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f34877995d0}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 getrlimit(RLIMIT_STACK, {rlim_cur=102400*1024, rlim_max=102400*1024}) = 0 brk(NULL) = 0x1b5d000 brk(0x1b7e000) = 0x1b7e000 brk(NULL) = 0x1b7e000 futex(0x690eb0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x7f348758396c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x7f3487583978, FUTEX_WAKE_PRIVATE, 2147483647) = 0 fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcd000 futex(0x7f34877890b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=54538, ...}) = 0 mmap(NULL, 54538, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3487dab000 close(3) = 0 open("/usr/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220&\r\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=19408552, ...}) = 0 mmap(NULL, 22076808, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f348548b000 mprotect(0x7f34865f1000, 2093056, PROT_NONE) = 0 mmap(0x7f34867f0000, 1167360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1165000) = 0x7f34867f0000 mmap(0x7f348690d000, 572808, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f348690d000 close(3) = 0 stat("/etc/sysconfig/64bit_strstr_via_64bit_strstr_sse2_unaligned", 0x7ffc8533fa00) = -1 ENOENT (No such file or directory) sched_get_priority_max(SCHED_RR) = 99 sched_get_priority_min(SCHED_RR) = 1 munmap(0x7f3487dab000, 54538) = 0 open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3 read(3, "0-95\n", 8192) = 5 close(3) = 0 sched_getaffinity(252, 16, 0x1b5d900) = -1 EINVAL (Invalid argument) sched_getaffinity(252, 131072, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 65536, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 32768, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 16384, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 8192, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 4096, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 2048, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 1024, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 512, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 512 sched_getaffinity(252, 256, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 256 sched_getaffinity(252, 128, [24, 25, 26, 27, 72, 73, 74, 75]) = 128 sched_getaffinity(252, 64, [24, 25, 26, 27, 72, 73, 74, 75]) = 64 sched_getaffinity(252, 32, [24, 25, 26, 27, 72, 73, 74, 75]) = 32 sched_getaffinity(252, 16, 0x1b5d900) = -1 EINVAL (Invalid argument) sched_getaffinity(252, 24, 0x1b5d900) = -1 EINVAL (Invalid argument) clock_gettime(CLOCK_MONOTONIC_RAW, {tv_sec=9951, tv_nsec=807665762}) = 0 open("/proc/sys/vm/mmap_min_addr", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "4096\n", 1024) = 5 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/cpuinfo", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "processor\t: 0\nvendor_id\t: Genuin"..., 1024) = 1024 read(3, "x512cd avx512bw avx512vl xsaveop"..., 1024) = 1024 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/self/maps", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "00400000-0048d000 r-xp 00000000 "..., 1024) = 1024 read(3, "r/lib64/libc-2.17.so\n7f3486d5b00"..., 1024) = 1024 read(3, "8727e000 rw-p 00101000 103:01 32"..., 1024) = 1024 read(3, "pthread-2.17.so\n7f34877a1000-7f3"..., 1024) = 1024 read(3, " /usr/lib64/ld-2.17.so\n7f"..., 1024) = 419 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 futex(0x7f348690e034, FUTEX_WAKE_PRIVATE, 2147483647) = 0 statfs("/dev/shm/", {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=49324841, f_bfree=49324831, f_bavail=49324831, f_files=49324841, f_ffree=49324830, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV}) = 0 futex(0x7f3487bad310, FUTEX_WAKE_PRIVATE, 2147483647) = 0 open("/dev/shm/cuda_injection_path_shm", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/root/.nv/nvidia-application-profile-globals-rc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/root/.nv/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/root/.nv/nvidia-application-profiles-rc.d", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/nvidia/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/nvidia/nvidia-application-profiles-rc.d/", O_RDONLY) = 3 fstat(3, {st_mode=S_IFDIR|0555, st_size=60, ...}) = 0 close(3) = 0 openat(AT_FDCWD, "/etc/nvidia/nvidia-application-profiles-rc.d/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3 getdents(3, /* 3 entries */, 32768) = 88 getdents(3, /* 0 entries */, 32768) = 0 close(3) = 0 open("/etc/nvidia/nvidia-application-profiles-rc.d//10-container.conf", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0555, st_size=139, ...}) = 0 fstat(3, {st_mode=S_IFREG|0555, st_size=139, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "{ \"profiles\": [{\"name\": \"_contai"..., 4096) = 139 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/usr/share/nvidia/nvidia-application-profiles-450.80.02-rc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/nvidia/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory) geteuid() = 0 socket(AF_UNIX, SOCK_SEQPACKET|SOCK_CLOEXEC, 0) = 3 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 connect(3, {sa_family=AF_UNIX, sun_path="/tmp/nvidia-mps/control"}, 26) = -1 ENOENT (No such file or directory) close(3) = 0 lstat("/proc", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 lstat("/proc/self", {st_mode=S_IFLNK|0777, st_size=0, ...}) = 0 readlink("/proc/self", "252", 4095) = 3 lstat("/proc/252", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 lstat("/proc/252/exe", {st_mode=S_IFLNK|0777, st_size=0, ...}) = 0 readlink("/proc/252/exe", "/mnt/tools/samples/0_Simple/mat"..., 4095) = 48 lstat("/mnt", {st_mode=S_IFDIR|0755, st_size=20, ...}) = 0 lstat("/mnt/tools", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/mnt/tools/samples", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/mnt/tools/samples/0_Simple", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/mnt/tools/samples/0_Simple/matrixMul", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/mnt/tools/samples/0_Simple/matrixMul/matrixMul", {st_mode=S_IFREG|0755, st_size=771874, ...}) = 0 mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487d98000 open("/proc/modules", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "iptable_raw 12678 1 - Live 0xfff"..., 1024) = 1024 read(3, "link 36354 0 - Live 0xffffffffc1"..., 1024) = 1024 read(3, "00 (OE)\nrdma_cm 60234 1 rdma_ucm"..., 1024) = 1024 read(3, "ffffffc42ef000\nkvm 586948 1 kvm_"..., 1024) = 1024 read(3, "1086 0 - Live 0xffffffffc0216000"..., 1024) = 1024 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/devices", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "Character devices:\n 1 mem\n 4 /"..., 1024) = 797 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/driver/nvidia/params", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=753, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "ResmanDebugLevel: 4294967295\nRmL"..., 4096) = 753 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/dev/nvidiactl", O_RDWR) = 3 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd2, 0x48), 0x7ffc8533f770) = 0 open("/sys/devices/system/memory/block_size_bytes", O_RDONLY) = 4 read(4, "8000000\n", 99) = 8 close(4) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd6, 0x8), 0x7ffc8533f770) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xca, 0x4), 0x7f3486997a20) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xc8, 0xa00), 0x7f3486997020) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc8533f890) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc8533f810) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc8533f880) = 0 close(3) = 0 open("/proc/modules", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "iptable_raw 12678 1 - Live 0xfff"..., 1024) = 1024 read(3, "link 36354 0 - Live 0xffffffffc1"..., 1024) = 1024 read(3, "00 (OE)\nrdma_cm 60234 1 rdma_ucm"..., 1024) = 1024 read(3, "ffffffc42ef000\nkvm 586948 1 kvm_"..., 1024) = 1024 read(3, "1086 0 - Live 0xffffffffc0216000"..., 1024) = 1024 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/devices", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "Character devices:\n 1 mem\n 4 /"..., 1024) = 797 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/driver/nvidia/params", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=753, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "ResmanDebugLevel: 4294967295\nRmL"..., 4096) = 753 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/dev/nvidiactl", O_RDWR) = 3 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd2, 0x48), 0x7ffc8533fd10) = 0 open("/sys/devices/system/memory/block_size_bytes", O_RDONLY) = 4 read(4, "8000000\n", 99) = 8 close(4) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd6, 0x8), 0x7ffc8533fd10) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xca, 0x4), 0x7f3486997a20) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xc8, 0xa00), 0x7f3486997020) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc8533fe30) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc8533fda0) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc8533fda0) = 0 open("/proc/self/status", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(4, "Name:\tmatrixMul\nUmask:\t0022\nStat"..., 1024) = 1024 read(4, "0000,00000000,00000000,00000000,"..., 1024) = 238 close(4) = 0 munmap(0x7f3487dcc000, 4096) = 0 openat(AT_FDCWD, "/sys/devices/system/node", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4 getdents(4, /* 11 entries */, 32768) = 344 open("/sys/devices/system/node/node0/cpumap", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(5, "00000000,00000000,00000000,00000"..., 4096) = 63 close(5) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/sys/devices/system/node/node1/cpumap", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(5, "00000000,00000000,00000000,00000"..., 4096) = 63 close(5) = 0 munmap(0x7f3487dcc000, 4096) = 0 getdents(4, /* 0 entries */, 32768) = 0 close(4) = 0 futex(0x7f34869124b8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 get_mempolicy([MPOL_DEFAULT], [000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000], 1024, NULL, 0) = 0 open("/proc/modules", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(4, "iptable_raw 12678 1 - Live 0xfff"..., 1024) = 1024 close(4) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/devices", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(4, "Character devices:\n 1 mem\n 4 /"..., 1024) = 797 close(4) = 0 munmap(0x7f3487dcc000, 4096) = 0 stat("/dev/nvidia-uvm", {st_mode=S_IFCHR|0666, st_rdev=makedev(229, 0), ...}) = 0 stat("/dev/nvidia-uvm-tools", {st_mode=S_IFCHR|0666, st_rdev=makedev(229, 1), ...}) = 0 open("/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EPERM (Operation not permitted) open("/dev/nvidia-uvm", O_RDWR) = -1 EPERM (Operation not permitted) ioctl(-1, _IOC(0, 0, 0x2, 0x3000), 0) = -1 EBADF (Bad file descriptor) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc8533fe70) = 0 close(3) = 0 munmap(0x7f3487d98000, 135168) = 0 munmap(0x7f348548b000, 22076808) = 0 futex(0x691790, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(2, "CUDA error at ../../common/inc/h"..., 112CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)" ) = 112 write(1, "[Matrix Multiply Using CUDA] - S"..., 43[Matrix Multiply Using CUDA] - Starting... ) = 43 exit_group(1) = ? +++ exited with 1 +++ command terminated with exit code 1
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [ ] The output of
nvidia-smi -a
on your host
root@k8s-t4-node:~$ nvidia-smi
Thu Jan 14 14:16:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3D:00.0 Off | 0 |
| N/A 27C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:3E:00.0 Off | 0 |
| N/A 29C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:40:00.0 Off | 0 |
| N/A 27C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:41:00.0 Off | 0 |
| N/A 28C P8 11W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla T4 Off | 00000000:B1:00.0 Off | 0 |
| N/A 28C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla T4 Off | 00000000:B2:00.0 Off | 0 |
| N/A 28C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla T4 Off | 00000000:B4:00.0 Off | 0 |
| N/A 28C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla T4 Off | 00000000:B5:00.0 Off | 0 |
| N/A 27C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json
)
root@k8s-t4-node:~$ cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
- [ ] The k8s-device-plugin container logs
root@k8s-master-node$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-5dx7q
2021/01/14 03:38:56 Loading NVML
2021/01/14 03:38:56 Starting FS watcher.
2021/01/14 03:38:56 Starting OS watcher.
2021/01/14 03:38:56 Retreiving plugins.
2021/01/14 03:38:56 Starting GRPC server for 'nvidia.com/gpu'
2021/01/14 03:38:56 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/01/14 03:38:56 Registered device plugin for 'nvidia.com/gpu' with Kubelet
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from
docker version
root@k8s-t4-node:~$ docker version
Client: Docker Engine - Community
Version: 20.10.2
API version: 1.41
Go version: go1.13.15
Git commit: 2291f61
Built: Mon Dec 28 16:17:48 2020
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.2
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8891c58
Built: Mon Dec 28 16:16:13 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
nvidia:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- [ ] Docker command, image and tag used
- [ ] Kernel version from
uname -a
root@k8s-t4-node:~$ uname -a
Linux k8s-t4-node 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
root@k8s-t4-node:~$ lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.6.1810 (Core)
Release: 7.6.1810
Codename: Core
- [ ] Any relevant kernel output lines from
dmesg
- [ ] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
root@k8s-t4-node:~$ rpm -qa | grep nvidia
libnvidia-container1-1.3.1-1.x86_64
nvidia-container-runtime-3.4.0-1.x86_64
nvidia-container-toolkit-1.4.0-2.x86_64
libnvidia-container-tools-1.3.1-1.x86_64
root@k8s-t4-node:~$ rpm -qa | grep docker-ce
docker-ce-rootless-extras-20.10.2-3.el7.x86_64
docker-ce-cli-20.10.2-3.el7.x86_64
docker-ce-20.10.2-3.el7.x86_64
root@k8s-t4-node:~$ rpm -qa | grep container
containerd.io-1.4.3-3.1.el7.x86_64
libnvidia-container1-1.3.1-1.x86_64
nvidia-container-runtime-3.4.0-1.x86_64
nvidia-container-toolkit-1.4.0-2.x86_64
container-selinux-2.119.2-1.911c772.el7_8.noarch
libnvidia-container-tools-1.3.1-1.x86_64
- [ ] NVIDIA container library version from
nvidia-container-cli -V
root@k8s-t4-node:~$ nvidia-container-cli -V
version: 1.3.1
build date: 2020-12-14T14:18+0000
build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- [ ] NVIDIA container library logs (see troubleshooting)
/kind bug /kind cgroup
1. Issue or feature description
When the server restarts, nvidia-device-plugin-daemonset cannot load the nvidia-uvm/nvidia-modeset device correctly
2. Steps to reproduce the issue
- Config kubelet with "static" CpuManagerPolicy and "single-numa-node" TopologyManagerPolicy
root@k8s-t4-node:~$ cat /etc/kubernetes/kubelet-config.yaml apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration nodeStatusUpdateFrequency: "10s" failSwapOn: True authentication: anonymous: enabled: false webhook: enabled: True x509: clientCAFile: /etc/kubernetes/ssl/ca.crt authorization: mode: Webhook staticPodPath: /etc/kubernetes/manifests cgroupDriver: cgroupfs maxPods: 110 address: 10.0.0.34 readOnlyPort: 10255 healthzPort: 10248 healthzBindAddress: 127.0.0.1 kubeletCgroups: /systemd/system.slice clusterDomain: cluster.local protectKernelDefaults: true rotateCertificates: true clusterDNS: - 169.254.25.10 systemReserved: cpu: 2000m memory: 2G kubeReserved: cpu: 2000m memory: 2G cpuManagerPolicy: "static" cpuManagerReconcilePeriod: 10s topologyManagerPolicy: "single-numa-node" resolvConf: "/etc/resolv.conf"
- Deploy nvidia-device-plugin with cpumanager
root@k8s-master-node:~$ kubectl apply -f nvidia-device-plugin-compat-with-cpumanager.yml
- Deploy test pod with cpumanager and topology manager
apiVersion: apps/v1 kind: Deployment metadata: name: k8s-sv0-test namespace: default spec: replicas: 8 selector: matchLabels: appid: k8s-sv0-test strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: labels: appid: k8s-sv0-test name: k8s-sv0-test spec: containers: - command: - /bin/bash - -c - while true; do echo $(date) && sleep 1; done image: nvidia/cuda:10.0-devel-centos7 imagePullPolicy: IfNotPresent name: k8s-sv0-test-container-1 resources: limits: cpu: "8" memory: 40Gi nvidia.com/gpu: "1" requests: cpu: "8" memory: 40Gi nvidia.com/gpu: "1" securityContext: capabilities: add: - ALL privileged: false volumeMounts: - mountPath: /mnt/tools name: tools dnsPolicy: ClusterFirst hostIPC: true hostNetwork: true nodeSelector: accelerate-type=nvidia priorityClassName: default-priority restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /mnt/tools type: "" name: tools
- Check the GPU device status in the container
root@k8s-master-node:~$ kubectl get pods -l 'appid=k8s-sv0-test' NAME READY STATUS RESTARTS AGE k8s-sv0-test-7c4bcdbc4-52c5g 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-5mx2g 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-7jhqb 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-j6mqh 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-jg5m9 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-p86lp 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-rt7ct 1/1 Running 0 13h k8s-sv0-test-7c4bcdbc4-xzf5p 1/1 Running 0 13h root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- lsmod | grep nvidia nvidia_uvm 966698 0 nvidia_drm 48653 0 nvidia_modeset 1177123 1 nvidia_drm nvidia 19683928 156 nvidia_modeset,nvidia_uvm drm_kms_helper 179394 2 ast,nvidia_drm drm 429744 5 ast,ttm,drm_kms_helper,nvidia_drm root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- ls -t -l /dev | grep nvidia crw-rw-rw- 1 root root 229, 0 Jan 13 20:41 nvidia-uvm crw-rw-rw- 1 root root 229, 1 Jan 13 20:41 nvidia-uvm-tools crw-rw-rw- 1 root root 195, 254 Jan 13 20:25 nvidia-modeset crw-rw-rw- 1 root root 195, 1 Jan 13 14:56 nvidia1 crw-rw-rw- 1 root root 195, 255 Jan 13 14:56 nvidiactl root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- nvidia-smi topo -m GPU0 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X NODE NODE 0-23,48-71 0 mlx5_0 NODE X PIX mlx5_1 NODE PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- nvidia-smi --query-gpu=gpu_name,gpu_uuid,gpu_bus_id,vbios_version --format=csv name, uuid, pci.bus_id, vbios_version Tesla T4, GPU-aa2cef33-2b58-a1ce-fb6a-e8b9808663ae, 00000000:3E:00.0, 90.04.38.00.03 root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- cat /sys/bus/pci/devices/0000\:3e\:00.0/numa_node 0 root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-52c5g -- /mnt/tools/samples/0_Simple/matrixMul/matrixMul [Matrix Multiply Using CUDA] - Starting... GPU Device 0: "Tesla T4" with compute capability 7.5 MatrixA(320,320), MatrixB(640,320) Computing result using CUDA Kernel... done Performance= 406.82 GFlop/s, Time= 0.322 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled. root@k8s-t4-node:~$ docker ps | grep k8s-sv0-test-7c4bcdbc4-52c5g | grep -v pause cc8153dc4cac 28875b98a13a "/bin/sh -c while true…" 7 minutes ago Up 7 minutes k8s_k8s-sv0-test-container-1_k8s-sv0-test-52c5g-6h994_default_7f8f5d67-9bab-4e5d-a6bc-d9e4fa1a7bd1_0 root@k8s-t4-node:~$ docker inspect cc8153dc4cac | jq '.[].HostConfig.Devices' | tr -d '\n| ' [{"PathOnHost":"/dev/nvidiactl","PathInContainer":"/dev/nvidiactl","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia-uvm","PathInContainer":"/dev/nvidia-uvm","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia-uvm-tools","PathInContainer":"/dev/nvidia-uvm-tools","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia-modeset","PathInContainer":"/dev/nvidia-modeset","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia2","PathInContainer":"/dev/nvidia2","CgroupPermissions":"rw"}] root@k8s-t4-node:~$ docker exec -i cc8153dc4cac /bin/bash -c "cat /proc/self/cgroup | grep device" 2:devices:/kubepods/pod7f8f5d67-9bab-4e5d-a6bc-d9e4fa1a7bd1/cc8153dc4cac755031cadc0e2748baee59f382ea3d0834e0130f2bb197330591 root@k8s-t4-node:~$ cat /sys/fs/cgroup/devices/kubepods/pod7f8f5d67-9bab-4e5d-a6bc-d9e4fa1a7bd1/cc8153dc4cac755031cadc0e2748baee59f382ea3d0834e0130f2bb197330591/devices.list b *:* m c *:* m c 1:3 rwm c 1:5 rwm c 1:7 rwm c 1:8 rwm c 1:9 rwm c 5:0 rwm c 5:1 rwm c 5:2 rwm c 10:200 rwm c 136:* rwm c 195:2 rw c 195:254 rw c 195:255 rw c 229:0 rw c 229:1 rw
- Reboot server and recheck the GPU device status in the container
root@k8s-master-node:~$ kubectl get pods -l 'appid=k8s-sv0-test' NAME READY STATUS RESTARTS AGE k8s-sv0-test-7c4bcdbc4-8l46s 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-96svs 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-fgls4 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-jzfgp 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-kz72h 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-tw9nj 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-vbg5g 1/1 Running 1 6m40s k8s-sv0-test-7c4bcdbc4-xv595 1/1 Running 1 6m40s root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- lsmod | grep nvidia nvidia_uvm 966698 0 nvidia_drm 48653 0 nvidia_modeset 1177123 1 nvidia_drm nvidia 19683928 156 nvidia_modeset,nvidia_uvm drm_kms_helper 179394 2 ast,nvidia_drm drm 429744 5 ast,ttm,drm_kms_helper,nvidia_drm root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- ls -t -l /dev | grep nvidia crw-rw-rw- 1 root root 195, 254 Jan 14 11:18 nvidia-modeset crw-rw-rw- 1 root root 229, 0 Jan 14 11:18 nvidia-uvm crw-rw-rw- 1 root root 229, 1 Jan 14 11:18 nvidia-uvm-tools crw-rw-rw- 1 root root 195, 4 Jan 14 11:18 nvidia4 crw-rw-rw- 1 root root 195, 255 Jan 14 11:18 nvidiactl root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- nvidia-smi topo -m GPU0 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X SYS SYS 24-47,72-95 1 mlx5_0 SYS X PIX mlx5_1 SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- nvidia-smi --query-gpu=gpu_name,gpu_uuid,gpu_bus_id,vbios_version --format=csv name, uuid, pci.bus_id, vbios_version Tesla T4, GPU-5678bde6-66f9-b29c-a344-b421c9896d0b, 00000000:B1:00.0, 90.04.38.00.03 root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- cat /sys/bus/pci/devices/0000\:b1\:00.0/numa_node 1 root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- /mnt/tools/samples/0_Simple/matrixMul/matrixMul [Matrix Multiply Using CUDA] - Starting... CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)" command terminated with exit code 1 root@k8s-t4-node:~$ docker ps | grep k8s-sv0-test-7c4bcdbc4-8l46s | grep -v pause caebc32e21de 28875b98a13a "/bin/sh -c /nfs/pro…" 3 hours ago Up 3 hours k8s_k8s-sv0-test-container-1_k8s-sv0-test-7c4bcdbc4-8l46s_default_f18597a0-4f05-4cf4-a454-e57370cbde6c_6 root@k8s-t4-node:~$ docker inspect caebc32e21de | jq '.[].HostConfig.Devices' | tr -d '\n| ' [{"PathOnHost":"/dev/nvidiactl","PathInContainer":"/dev/nvidiactl","CgroupPermissions":"rw"},{"PathOnHost":"/dev/nvidia4","PathInContainer":"/dev/nvidia4","CgroupPermissions":"rw"}] root@k8s-t4-node:~$ docker exec -i caebc32e21de /bin/bash -c "cat /proc/self/cgroup | grep device" 2:devices:/kubepods/podf18597a0-4f05-4cf4-a454-e57370cbde6c/caebc32e21de39bdbdcbbc3b1d8b1417bbc13afb85618871d558083be5886081 root@k8s-t4-node:~$ cat /sys/fs/cgroup/devices/kubepods/podf18597a0-4f05-4cf4-a454-e57370cbde6c/caebc32e21de39bdbdcbbc3b1d8b1417bbc13afb85618871d558083be5886081/devices.list b *:* m c *:* m c 1:3 rwm c 1:5 rwm c 1:7 rwm c 1:8 rwm c 1:9 rwm c 5:0 rwm c 5:1 rwm c 5:2 rwm c 10:200 rwm c 136:* rwm c 195:4 rw c 195:255 rw
- After reboot server, kubelet restart contaienr, container had no permission to access nvidia-uvm device
root@k8s-master-node:~$ kubectl exec -i k8s-sv0-test-7c4bcdbc4-8l46s -- strace -f /mnt/tools/samples/0_Simple/matrixMul/matrixMul execve("/mnt/tools/samples/0_Simple/matrixMul/matrixMul", ["/mnt/tools/samples/0_Simple/mat"...], 0x7fff52bd8228 /* 32 vars */) = 0 brk(NULL) = 0x1b5d000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dce000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/usr/local/cuda/extras/CUPTI/lib64/tls/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/cuda/extras/CUPTI/lib64/tls/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/cuda/extras/CUPTI/lib64/tls/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/cuda/extras/CUPTI/lib64/tls", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/cuda/extras/CUPTI/lib64/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/cuda/extras/CUPTI/lib64/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/cuda/extras/CUPTI/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/cuda/extras/CUPTI/lib64", {st_mode=S_IFDIR|0755, st_size=75, ...}) = 0 open("/usr/local/nvidia/lib/tls/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib/tls/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib/tls/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib/tls", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib64/tls/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib64/tls/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib64/tls/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib64/tls", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib64/x86_64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib64/x86_64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/usr/local/nvidia/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/nvidia/lib64", 0x7ffc8533ef40) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=54538, ...}) = 0 mmap(NULL, 54538, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3487dc0000 close(3) = 0 open("/usr/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340!\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=43776, ...}) = 0 mmap(NULL, 2128920, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f34879a6000 mprotect(0x7f34879ad000, 2093056, PROT_NONE) = 0 mmap(0x7f3487bac000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f3487bac000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260l\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=141968, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dbf000 mmap(NULL, 2208904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f348778a000 mprotect(0x7f34877a1000, 2093056, PROT_NONE) = 0 mmap(0x7f34879a0000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f34879a0000 mmap(0x7f34879a2000, 13448, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f34879a2000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\r\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=19288, ...}) = 0 mmap(NULL, 2109712, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3487586000 mprotect(0x7f3487588000, 2097152, PROT_NONE) = 0 mmap(0x7f3487788000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f3487788000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\265\5\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=995840, ...}) = 0 mmap(NULL, 3175456, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f348727e000 mprotect(0x7f3487367000, 2097152, PROT_NONE) = 0 mmap(0x7f3487567000, 40960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xe9000) = 0x7f3487567000 mmap(0x7f3487571000, 82976, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3487571000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20S\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1137024, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dbe000 mmap(NULL, 3150120, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3486f7c000 mprotect(0x7f348707d000, 2093056, PROT_NONE) = 0 mmap(0x7f348727c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x100000) = 0x7f348727c000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360*\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=88720, ...}) = 0 mmap(NULL, 2184192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3486d66000 mprotect(0x7f3486d7b000, 2093056, PROT_NONE) = 0 mmap(0x7f3486f7a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7f3486f7a000 close(3) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340$\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=2151672, ...}) = 0 mmap(NULL, 3981792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3486999000 mprotect(0x7f3486b5b000, 2097152, PROT_NONE) = 0 mmap(0x7f3486d5b000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c2000) = 0x7f3486d5b000 mmap(0x7f3486d61000, 16864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3486d61000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dbd000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dbc000 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dba000 arch_prctl(ARCH_SET_FS, 0x7f3487dba740) = 0 mprotect(0x7f3486d5b000, 16384, PROT_READ) = 0 mprotect(0x7f3486f7a000, 4096, PROT_READ) = 0 mprotect(0x7f348727c000, 4096, PROT_READ) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487db9000 mprotect(0x7f3487567000, 32768, PROT_READ) = 0 mprotect(0x7f3487788000, 4096, PROT_READ) = 0 mprotect(0x7f34879a0000, 4096, PROT_READ) = 0 mprotect(0x7f3487bac000, 4096, PROT_READ) = 0 mprotect(0x68d000, 12288, PROT_READ) = 0 mprotect(0x7f3487dcf000, 4096, PROT_READ) = 0 munmap(0x7f3487dc0000, 54538) = 0 set_tid_address(0x7f3487dbaa10) = 252 set_robust_list(0x7f3487dbaa20, 24) = 0 rt_sigaction(SIGRTMIN, {sa_handler=0x7f3487790790, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f34877995d0}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {sa_handler=0x7f3487790820, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f34877995d0}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 getrlimit(RLIMIT_STACK, {rlim_cur=102400*1024, rlim_max=102400*1024}) = 0 brk(NULL) = 0x1b5d000 brk(0x1b7e000) = 0x1b7e000 brk(NULL) = 0x1b7e000 futex(0x690eb0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x7f348758396c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x7f3487583978, FUTEX_WAKE_PRIVATE, 2147483647) = 0 fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcd000 futex(0x7f34877890b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 open("/usr/local/cuda/extras/CUPTI/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=54538, ...}) = 0 mmap(NULL, 54538, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3487dab000 close(3) = 0 open("/usr/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220&\r\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=19408552, ...}) = 0 mmap(NULL, 22076808, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f348548b000 mprotect(0x7f34865f1000, 2093056, PROT_NONE) = 0 mmap(0x7f34867f0000, 1167360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1165000) = 0x7f34867f0000 mmap(0x7f348690d000, 572808, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f348690d000 close(3) = 0 stat("/etc/sysconfig/64bit_strstr_via_64bit_strstr_sse2_unaligned", 0x7ffc8533fa00) = -1 ENOENT (No such file or directory) sched_get_priority_max(SCHED_RR) = 99 sched_get_priority_min(SCHED_RR) = 1 munmap(0x7f3487dab000, 54538) = 0 open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3 read(3, "0-95\n", 8192) = 5 close(3) = 0 sched_getaffinity(252, 16, 0x1b5d900) = -1 EINVAL (Invalid argument) sched_getaffinity(252, 131072, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 65536, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 32768, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 16384, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 8192, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 4096, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 2048, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 1024, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 640 sched_getaffinity(252, 512, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 512 sched_getaffinity(252, 256, [24, 25, 26, 27, 72, 73, 74, 75, ...]) = 256 sched_getaffinity(252, 128, [24, 25, 26, 27, 72, 73, 74, 75]) = 128 sched_getaffinity(252, 64, [24, 25, 26, 27, 72, 73, 74, 75]) = 64 sched_getaffinity(252, 32, [24, 25, 26, 27, 72, 73, 74, 75]) = 32 sched_getaffinity(252, 16, 0x1b5d900) = -1 EINVAL (Invalid argument) sched_getaffinity(252, 24, 0x1b5d900) = -1 EINVAL (Invalid argument) clock_gettime(CLOCK_MONOTONIC_RAW, {tv_sec=9951, tv_nsec=807665762}) = 0 open("/proc/sys/vm/mmap_min_addr", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "4096\n", 1024) = 5 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/cpuinfo", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "processor\t: 0\nvendor_id\t: Genuin"..., 1024) = 1024 read(3, "x512cd avx512bw avx512vl xsaveop"..., 1024) = 1024 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/self/maps", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "00400000-0048d000 r-xp 00000000 "..., 1024) = 1024 read(3, "r/lib64/libc-2.17.so\n7f3486d5b00"..., 1024) = 1024 read(3, "8727e000 rw-p 00101000 103:01 32"..., 1024) = 1024 read(3, "pthread-2.17.so\n7f34877a1000-7f3"..., 1024) = 1024 read(3, " /usr/lib64/ld-2.17.so\n7f"..., 1024) = 419 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 futex(0x7f348690e034, FUTEX_WAKE_PRIVATE, 2147483647) = 0 statfs("/dev/shm/", {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=49324841, f_bfree=49324831, f_bavail=49324831, f_files=49324841, f_ffree=49324830, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV}) = 0 futex(0x7f3487bad310, FUTEX_WAKE_PRIVATE, 2147483647) = 0 open("/dev/shm/cuda_injection_path_shm", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/root/.nv/nvidia-application-profile-globals-rc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/root/.nv/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/root/.nv/nvidia-application-profiles-rc.d", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/nvidia/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/nvidia/nvidia-application-profiles-rc.d/", O_RDONLY) = 3 fstat(3, {st_mode=S_IFDIR|0555, st_size=60, ...}) = 0 close(3) = 0 openat(AT_FDCWD, "/etc/nvidia/nvidia-application-profiles-rc.d/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3 getdents(3, /* 3 entries */, 32768) = 88 getdents(3, /* 0 entries */, 32768) = 0 close(3) = 0 open("/etc/nvidia/nvidia-application-profiles-rc.d//10-container.conf", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0555, st_size=139, ...}) = 0 fstat(3, {st_mode=S_IFREG|0555, st_size=139, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "{ \"profiles\": [{\"name\": \"_contai"..., 4096) = 139 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/usr/share/nvidia/nvidia-application-profiles-450.80.02-rc", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/nvidia/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory) geteuid() = 0 socket(AF_UNIX, SOCK_SEQPACKET|SOCK_CLOEXEC, 0) = 3 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 connect(3, {sa_family=AF_UNIX, sun_path="/tmp/nvidia-mps/control"}, 26) = -1 ENOENT (No such file or directory) close(3) = 0 lstat("/proc", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 lstat("/proc/self", {st_mode=S_IFLNK|0777, st_size=0, ...}) = 0 readlink("/proc/self", "252", 4095) = 3 lstat("/proc/252", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 lstat("/proc/252/exe", {st_mode=S_IFLNK|0777, st_size=0, ...}) = 0 readlink("/proc/252/exe", "/mnt/tools/samples/0_Simple/mat"..., 4095) = 48 lstat("/mnt", {st_mode=S_IFDIR|0755, st_size=20, ...}) = 0 lstat("/mnt/tools", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/mnt/tools/samples", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/mnt/tools/samples/0_Simple", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/mnt/tools/samples/0_Simple/matrixMul", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/mnt/tools/samples/0_Simple/matrixMul/matrixMul", {st_mode=S_IFREG|0755, st_size=771874, ...}) = 0 mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487d98000 open("/proc/modules", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "iptable_raw 12678 1 - Live 0xfff"..., 1024) = 1024 read(3, "link 36354 0 - Live 0xffffffffc1"..., 1024) = 1024 read(3, "00 (OE)\nrdma_cm 60234 1 rdma_ucm"..., 1024) = 1024 read(3, "ffffffc42ef000\nkvm 586948 1 kvm_"..., 1024) = 1024 read(3, "1086 0 - Live 0xffffffffc0216000"..., 1024) = 1024 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/devices", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "Character devices:\n 1 mem\n 4 /"..., 1024) = 797 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/driver/nvidia/params", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=753, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "ResmanDebugLevel: 4294967295\nRmL"..., 4096) = 753 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/dev/nvidiactl", O_RDWR) = 3 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd2, 0x48), 0x7ffc8533f770) = 0 open("/sys/devices/system/memory/block_size_bytes", O_RDONLY) = 4 read(4, "8000000\n", 99) = 8 close(4) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd6, 0x8), 0x7ffc8533f770) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xca, 0x4), 0x7f3486997a20) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xc8, 0xa00), 0x7f3486997020) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc8533f890) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc8533f810) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc8533f880) = 0 close(3) = 0 open("/proc/modules", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "iptable_raw 12678 1 - Live 0xfff"..., 1024) = 1024 read(3, "link 36354 0 - Live 0xffffffffc1"..., 1024) = 1024 read(3, "00 (OE)\nrdma_cm 60234 1 rdma_ucm"..., 1024) = 1024 read(3, "ffffffc42ef000\nkvm 586948 1 kvm_"..., 1024) = 1024 read(3, "1086 0 - Live 0xffffffffc0216000"..., 1024) = 1024 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/devices", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "Character devices:\n 1 mem\n 4 /"..., 1024) = 797 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/driver/nvidia/params", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=753, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(3, "ResmanDebugLevel: 4294967295\nRmL"..., 4096) = 753 close(3) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/dev/nvidiactl", O_RDWR) = 3 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd2, 0x48), 0x7ffc8533fd10) = 0 open("/sys/devices/system/memory/block_size_bytes", O_RDONLY) = 4 read(4, "8000000\n", 99) = 8 close(4) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd6, 0x8), 0x7ffc8533fd10) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xca, 0x4), 0x7f3486997a20) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xc8, 0xa00), 0x7f3486997020) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc8533fe30) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc8533fda0) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc8533fda0) = 0 open("/proc/self/status", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(4, "Name:\tmatrixMul\nUmask:\t0022\nStat"..., 1024) = 1024 read(4, "0000,00000000,00000000,00000000,"..., 1024) = 238 close(4) = 0 munmap(0x7f3487dcc000, 4096) = 0 openat(AT_FDCWD, "/sys/devices/system/node", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4 getdents(4, /* 11 entries */, 32768) = 344 open("/sys/devices/system/node/node0/cpumap", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(5, "00000000,00000000,00000000,00000"..., 4096) = 63 close(5) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/sys/devices/system/node/node1/cpumap", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(5, "00000000,00000000,00000000,00000"..., 4096) = 63 close(5) = 0 munmap(0x7f3487dcc000, 4096) = 0 getdents(4, /* 0 entries */, 32768) = 0 close(4) = 0 futex(0x7f34869124b8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 get_mempolicy([MPOL_DEFAULT], [000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000, 000000000000000000], 1024, NULL, 0) = 0 open("/proc/modules", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(4, "iptable_raw 12678 1 - Live 0xfff"..., 1024) = 1024 close(4) = 0 munmap(0x7f3487dcc000, 4096) = 0 open("/proc/devices", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3487dcc000 read(4, "Character devices:\n 1 mem\n 4 /"..., 1024) = 797 close(4) = 0 munmap(0x7f3487dcc000, 4096) = 0 stat("/dev/nvidia-uvm", {st_mode=S_IFCHR|0666, st_rdev=makedev(229, 0), ...}) = 0 stat("/dev/nvidia-uvm-tools", {st_mode=S_IFCHR|0666, st_rdev=makedev(229, 1), ...}) = 0 open("/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EPERM (Operation not permitted) open("/dev/nvidia-uvm", O_RDWR) = -1 EPERM (Operation not permitted) ioctl(-1, _IOC(0, 0, 0x2, 0x3000), 0) = -1 EBADF (Bad file descriptor) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc8533fe70) = 0 close(3) = 0 munmap(0x7f3487d98000, 135168) = 0 munmap(0x7f348548b000, 22076808) = 0 futex(0x691790, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(2, "CUDA error at ../../common/inc/h"..., 112CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)" ) = 112 write(1, "[Matrix Multiply Using CUDA] - S"..., 43[Matrix Multiply Using CUDA] - Starting... ) = 43 exit_group(1) = ? +++ exited with 1 +++ command terminated with exit code 1
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [ ] The output of
nvidia-smi -a
on your hostroot@k8s-t4-node:~$ nvidia-smi Thu Jan 14 14:16:05 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:3D:00.0 Off | 0 | | N/A 27C P8 10W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:3E:00.0 Off | 0 | | N/A 29C P8 10W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla T4 Off | 00000000:40:00.0 Off | 0 | | N/A 27C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla T4 Off | 00000000:41:00.0 Off | 0 | | N/A 28C P8 11W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla T4 Off | 00000000:B1:00.0 Off | 0 | | N/A 28C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla T4 Off | 00000000:B2:00.0 Off | 0 | | N/A 28C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla T4 Off | 00000000:B4:00.0 Off | 0 | | N/A 28C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla T4 Off | 00000000:B5:00.0 Off | 0 | | N/A 27C P8 10W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
- [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json
)root@k8s-t4-node:~$ cat /etc/docker/daemon.json { "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia" }
- [ ] The k8s-device-plugin container logs
root@k8s-master-node$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-5dx7q 2021/01/14 03:38:56 Loading NVML 2021/01/14 03:38:56 Starting FS watcher. 2021/01/14 03:38:56 Starting OS watcher. 2021/01/14 03:38:56 Retreiving plugins. 2021/01/14 03:38:56 Starting GRPC server for 'nvidia.com/gpu' 2021/01/14 03:38:56 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2021/01/14 03:38:56 Registered device plugin for 'nvidia.com/gpu' with Kubelet
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from
docker version
root@k8s-t4-node:~$ docker version Client: Docker Engine - Community Version: 20.10.2 API version: 1.41 Go version: go1.13.15 Git commit: 2291f61 Built: Mon Dec 28 16:17:48 2020 OS/Arch: linux/amd64 Context: default Experimental: true Server: Docker Engine - Community Engine: Version: 20.10.2 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: 8891c58 Built: Mon Dec 28 16:16:13 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.3 GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b nvidia: Version: 1.0.0-rc92 GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff docker-init: Version: 0.19.0 GitCommit: de40ad0
- [ ] Docker command, image and tag used
- [ ] Kernel version from
uname -a
root@k8s-t4-node:~$ uname -a Linux k8s-t4-node 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux root@k8s-t4-node:~$ lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.6.1810 (Core) Release: 7.6.1810 Codename: Core
- [ ] Any relevant kernel output lines from
dmesg
- [ ] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
root@k8s-t4-node:~$ rpm -qa | grep nvidia libnvidia-container1-1.3.1-1.x86_64 nvidia-container-runtime-3.4.0-1.x86_64 nvidia-container-toolkit-1.4.0-2.x86_64 libnvidia-container-tools-1.3.1-1.x86_64 root@k8s-t4-node:~$ rpm -qa | grep docker-ce docker-ce-rootless-extras-20.10.2-3.el7.x86_64 docker-ce-cli-20.10.2-3.el7.x86_64 docker-ce-20.10.2-3.el7.x86_64 root@k8s-t4-node:~$ rpm -qa | grep container containerd.io-1.4.3-3.1.el7.x86_64 libnvidia-container1-1.3.1-1.x86_64 nvidia-container-runtime-3.4.0-1.x86_64 nvidia-container-toolkit-1.4.0-2.x86_64 container-selinux-2.119.2-1.911c772.el7_8.noarch libnvidia-container-tools-1.3.1-1.x86_64
- [ ] NVIDIA container library version from
nvidia-container-cli -V
root@k8s-t4-node:~$ nvidia-container-cli -V version: 1.3.1 build date: 2020-12-14T14:18+0000 build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- [ ] NVIDIA container library logs (see troubleshooting)
@klueska @RenaudWasTaken @nvjmayo
This issue also can be reproduced on P40
After debugging, I found that device nvidia-uvm/nvidia-modeset/nvidia-uvm-tools would be loaded after Docker was started, so the nvidia-device-plugin container could not load nvidia-uvm/nvidia-modeset/nvidia-uvm-tools devices
-
After reboot server, nvidia-device-plugin container cannot reload nvidia-uvm/nvidia-modeset/nvidia-uvm-tools devices
root@k8s-t4-node:~$ ls -t -l /dev | grep nvidia crw-rw-rw- 1 root root 195, 254 Jan 14 14:59 nvidia-modeset crw-rw-rw- 1 root root 229, 0 Jan 14 14:59 nvidia-uvm crw-rw-rw- 1 root root 229, 1 Jan 14 14:59 nvidia-uvm-tools crw-rw-rw- 1 root root 195, 7 Jan 14 14:59 nvidia7 crw-rw-rw- 1 root root 195, 6 Jan 14 14:59 nvidia6 crw-rw-rw- 1 root root 195, 5 Jan 14 14:59 nvidia5 crw-rw-rw- 1 root root 195, 4 Jan 14 14:59 nvidia4 crw-rw-rw- 1 root root 195, 3 Jan 14 14:59 nvidia3 crw-rw-rw- 1 root root 195, 2 Jan 14 14:59 nvidia2 crw-rw-rw- 1 root root 195, 1 Jan 14 14:59 nvidia1 crw-rw-rw- 1 root root 195, 0 Jan 14 14:59 nvidia0 drwxr-xr-x 2 root root 80 Jan 14 14:59 nvidia-caps crw-rw-rw- 1 root root 195, 255 Jan 14 14:59 nvidiactl root@k8s-t4-node:~$ docker ps | grep nvidia-device-plugin | grep -v pause a0dcf15817e5 b48f7a3a6afc "nvidia-device-plugi…" 9 minutes ago Up 9 minutes k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-5dx7q_kube-system_4d6c2170-987f-4c05-bafe-56ad17623b23_2 root@k8s-t4-node:~$ docker exec -i a0dcf15817e5 /bin/bash -c "ls -t -l /dev | grep nvidia" drwxr-xr-x 2 root root 80 Jan 14 06:59 nvidia-caps crw-rw-rw- 1 root root 195, 7 Jan 14 06:59 nvidia7 crw-rw-rw- 1 root root 195, 6 Jan 14 06:59 nvidia6 crw-rw-rw- 1 root root 195, 5 Jan 14 06:59 nvidia5 crw-rw-rw- 1 root root 195, 4 Jan 14 06:59 nvidia4 crw-rw-rw- 1 root root 195, 3 Jan 14 06:59 nvidia3 crw-rw-rw- 1 root root 195, 2 Jan 14 06:59 nvidia2 crw-rw-rw- 1 root root 195, 1 Jan 14 06:59 nvidia1 crw-rw-rw- 1 root root 195, 0 Jan 14 06:59 nvidia0 crw-rw-rw- 1 root root 195, 255 Jan 14 06:59 nvidiactl
-
Added the log
log.Printf("Allocate responses %+v", responses)
before Allocate returns, recompiled k8s-device-plugin image, caught the problem sceneroot@k8s-master-node$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-5dx7q 2021/01/14 06:59:39 Loading NVML 2021/01/14 06:59:39 Starting FS watcher. 2021/01/14 06:59:39 Starting OS watcher. 2021/01/14 06:59:39 Retreiving plugins. 2021/01/14 06:59:39 Starting GRPC server for 'nvidia.com/gpu' 2021/01/14 06:59:39 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2021/01/14 06:59:39 Registered device plugin for 'nvidia.com/gpu' with Kubelet 2021/01/14 06:59:51 Allocate responses {ContainerResponses:[&ContainerAllocateResponse{Envs:map[string]string{NVIDIA_VISIBLE_DEVICES: GPU-7c89bc26-07ca-fd1c-a2c9-11b4033de5e3,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/nvidiactl,HostPath:/dev/nvidiactl,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/nvidia0,HostPath:/dev/nvidia0,Permissions:rw,},},Annotations:map[string]string{},}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} 2021/01/14 06:59:51 Allocate responses {ContainerResponses:[&ContainerAllocateResponse{Envs:map[string]string{NVIDIA_VISIBLE_DEVICES: GPU-aa2cef33-2b58-a1ce-fb6a-e8b9808663ae,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/nvidiactl,HostPath:/dev/nvidiactl,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/nvidia1,HostPath:/dev/nvidia1,Permissions:rw,},},Annotations:map[string]string{},}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} 2021/01/14 06:59:51 Allocate responses {ContainerResponses:[&ContainerAllocateResponse{Envs:map[string]string{NVIDIA_VISIBLE_DEVICES: GPU-1b0fda1c-d45e-c6cc-6ee2-5dfddaf28f7a,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/nvidiactl,HostPath:/dev/nvidiactl,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/nvidia2,HostPath:/dev/nvidia2,Permissions:rw,},},Annotations:map[string]string{},}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} 2021/01/14 06:59:51 Allocate responses {ContainerResponses:[&ContainerAllocateResponse{Envs:map[string]string{NVIDIA_VISIBLE_DEVICES: GPU-f46214f5-c14e-8420-48fc-a78a84339d2c,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/nvidiactl,HostPath:/dev/nvidiactl,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/nvidia3,HostPath:/dev/nvidia3,Permissions:rw,},},Annotations:map[string]string{},}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} 2021/01/14 06:59:51 Allocate responses {ContainerResponses:[&ContainerAllocateResponse{Envs:map[string]string{NVIDIA_VISIBLE_DEVICES: GPU-5678bde6-66f9-b29c-a344-b421c9896d0b,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/nvidiactl,HostPath:/dev/nvidiactl,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/nvidia4,HostPath:/dev/nvidia4,Permissions:rw,},},Annotations:map[string]string{},}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} 2021/01/14 06:59:51 Allocate responses {ContainerResponses:[&ContainerAllocateResponse{Envs:map[string]string{NVIDIA_VISIBLE_DEVICES: GPU-7e66223d-a4b8-a093-4b5c-bf359f825046,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/nvidiactl,HostPath:/dev/nvidiactl,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/nvidia7,HostPath:/dev/nvidia7,Permissions:rw,},},Annotations:map[string]string{},}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} 2021/01/14 06:59:51 Allocate responses {ContainerResponses:[&ContainerAllocateResponse{Envs:map[string]string{NVIDIA_VISIBLE_DEVICES: GPU-25f9a2c9-1819-8d46-52cc-c623eb770f66,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/nvidiactl,HostPath:/dev/nvidiactl,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/nvidia5,HostPath:/dev/nvidia5,Permissions:rw,},},Annotations:map[string]string{},}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} 2021/01/14 07:00:36 Allocate responses {ContainerResponses:[&ContainerAllocateResponse{Envs:map[string]string{NVIDIA_VISIBLE_DEVICES: GPU-55aa3226-8a73-7e80-90e2-eb8394f17b8d,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/nvidiactl,HostPath:/dev/nvidiactl,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/nvidia6,HostPath:/dev/nvidia6,Permissions:rw,},},Annotations:map[string]string{},}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
-
The files (/dev/nvidia-modeset, /dev/nvidia-uvm, /dev/nvidia-uvm-tools)does not exist, resulting in the container not returning the device correctly when applying for the GPU.
- Add nvidia udev rules, docker started after systemd-udev-trigger can workaround this issue
root@k8s-t4-node:~$ cat /etc/udev/rules.d/71-nvidia.rules
# Load and unload nvidia-modeset module
SUBSYSTEM=="module", ACTION=="add", DEVPATH=="/module/nvidia", RUN+="/usr/bin/nvidia-modprobe -m"
SUBSYSTEM=="module", ACTION=="remove", DEVPATH=="/module/nvidia", RUN+="/sbin/modprobe -r nvidia-modeset"
# Load and unload nvidia-drm module
SUBSYSTEM=="module", ACTION=="add", DEVPATH=="/module/nvidia", RUN+="/sbin/modprobe nvidia-drm"
SUBSYSTEM=="module", ACTION=="remove", DEVPATH=="/module/nvidia", RUN+="/sbin/modprobe -r nvidia-drm"
# Load and unload nvidia-uvm module
SUBSYSTEM=="module", ACTION=="add", DEVPATH=="/module/nvidia", RUN+="/usr/bin/nvidia-modprobe -u -c 1"
SUBSYSTEM=="module", ACTION=="remove", DEVPATH=="/module/nvidia", RUN+="/sbin/modprobe -r nvidia-uvm"
root@k8s-t4-node:~$ cat /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service systemd-udev-trigger.service
Wants=network-online.target
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd --storage-opt overlay2.override_kernel_check=1 --storage-driver=overlay2 --data-root=/data/docker_rt --live-restore --debug=false
MountFlags=slave
LimitNOFILE=1000000
LimitNPROC=100000
LimitCORE=102400000000
LimitSTACK=104857600
LimitSIGPENDING=600000
TimeoutStartSec=0
Restart=on-failure
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
[Install]
WantedBy=multi-user.target
Due to GPL reasons, the NVIDIA kernel driver is not allowed to create /dev
nodes in user-space.
A separate binary called nvidia-modprobe
is shipped with the driver that can be invoked from userspace to create the necessary /dev
nodes on your behalf.
To create the /dev/nvidia-uvm*
ones, the relevant command is:
nvidia-modprobe -u -c=0
To create the /dev/nvidia-modeset
device, the relevant command is:
/usr/bin/nvidia-modprobe -m
To create the /dev/nvidia*
devices corresponding to actual GPUs, the relevant command is:
/usr/bin/nvidia-modprobe -c <gpu-minor>
The preferred method to invoke this program to create these /dev
nodes on your host is via nvidia-persistenced
. You can set this up as a service running on our system to ensure that all GPU devices are available at all times. Please see:
https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon
That said, I'm surprised that the device-plugin
doesn't see the /dev/nvidia-modeset
or /dev/nvidia-uvm*
devices because libnvidia-container
should make sure they are created and available inside any container that is launched by the nvidia-container-runtime
(which the device plugin definitely is).
Here is the relevant code: https://github.com/NVIDIA/libnvidia-container/blob/master/src/nvc.c#L205 https://github.com/NVIDIA/libnvidia-container/blob/master/src/nvc.c#L213
Can you show me your settings for libnvidia-container
under /etc/nvidia-container-runtime/config.toml
?
If you have load-kmods = false
, then this might explain things.
And if you have it set to that (and you want to keep it that way), then you will need to make sure the appropriate nvidia-modprobe
commands are called on your host system at bootup (either manually or via nvidia-persistenced
).
@klueska Thanks, This is /etc/nvidia-container-runtime/config.toml
configuration file on my machine.
I now understand the cause of the problem, which can also be circumvented by the above configuration file /etc/udev/rules.d/71-nvidia.rules
root@k8s-t4-node:~$ rpm -qa | grep nvidia
libnvidia-container1-1.3.1-1.x86_64
nvidia-container-runtime-3.4.0-1.x86_64
nvidia-container-toolkit-1.4.0-2.x86_64
libnvidia-container-tools-1.3.1-1.x86_64
root@k8s-t4-node:~$ rpm -ql nvidia-container-toolkit-1.4.0-2.x86_64
/etc/nvidia-container-runtime/config.toml
/usr/bin/nvidia-container-toolkit
/usr/libexec/oci/hooks.d/oci-nvidia-hook
/usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
/usr/share/licenses/nvidia-container-toolkit-1.4.0
/usr/share/licenses/nvidia-container-toolkit-1.4.0/LICENSE
root@k8s-t4-node:~$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.