nvidia-docker icon indicating copy to clipboard operation
nvidia-docker copied to clipboard

CUDA_MPS_ACTIVE_THREAD_PERCENTAGE did not support 3090 and A100

Open MoFHeka opened this issue 2 years ago • 0 comments

I tried

export CUDA_VISIBLE_DEVICES=0  
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps  
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log 
export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50
nvidia-cuda-mps-control -d

in host machine and docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0 -e CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50 --gpus=0 --runtime=nvidia --ipc=host 25859ecc2950 /bin/bash create a container.

And then I ran a simple bug long enough coda kernel. It seem that CUDA_MPS_ACTIVE_THREAD_PERCENTAGE didn't make any different, the GPU-Util in Nvidia-smi is still 100%. Also when I use nsys and ncu to profile my test kernel, its result did not change too much.

Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         851.14
    SM Frequency                                                             cycle/nsecond                           1.20
    Elapsed Cycles                                                                   cycle                     3141224135
    Memory [%]                                                                           %                          41.25
    SOL DRAM                                                                             %                          41.25
    Duration                                                                        second                           2.60
    SOL L1/TEX Cache                                                                     %                          21.88
    SOL L2 Cache                                                                         %                          15.29
    SM Active Cycles                                                                 cycle                  3130147158.79
    SM [%]                                                                               %                          53.52
    ---------------------------------------------------------------------- --------------- ------------------------------

nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I1105 13:13:10.401861 21813 nvc.c:372] initializing library context (version=1.4.0, build=704a698b7a0ceec07a48e56c37365c741718c2df)
I1105 13:13:10.401912 21813 nvc.c:346] using root /
I1105 13:13:10.401918 21813 nvc.c:347] using ldcache /etc/ld.so.cache
I1105 13:13:10.401925 21813 nvc.c:348] using unprivileged user 65534:65534
I1105 13:13:10.401943 21813 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1105 13:13:10.401986 21813 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I1105 13:13:10.406397 21814 nvc.c:274] loading kernel module nvidia
I1105 13:13:10.406620 21814 nvc.c:278] running mknod for /dev/nvidiactl
I1105 13:13:10.406655 21814 nvc.c:282] running mknod for /dev/nvidia0
I1105 13:13:10.406674 21814 nvc.c:282] running mknod for /dev/nvidia1
I1105 13:13:10.406693 21814 nvc.c:282] running mknod for /dev/nvidia2
I1105 13:13:10.406711 21814 nvc.c:282] running mknod for /dev/nvidia3
I1105 13:13:10.406730 21814 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I1105 13:13:10.412531 21814 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I1105 13:13:10.412656 21814 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I1105 13:13:10.415158 21814 nvc.c:292] loading kernel module nvidia_uvm
I1105 13:13:10.415189 21814 nvc.c:296] running mknod for /dev/nvidia-uvm
I1105 13:13:10.415267 21814 nvc.c:301] loading kernel module nvidia_modeset
I1105 13:13:10.415333 21814 nvc.c:305] running mknod for /dev/nvidia-modeset
I1105 13:13:10.415626 21815 driver.c:101] starting driver service
I1105 13:13:10.419781 21813 nvc_info.c:676] requesting driver information with ''
I1105 13:13:10.421494 21813 nvc_info.c:169] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.460.80
I1105 13:13:10.422016 21813 nvc_info.c:169] selecting /usr/lib64/libnvoptix.so.460.80
I1105 13:13:10.422399 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-tls.so.460.80
I1105 13:13:10.423001 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-rtcore.so.460.80
I1105 13:13:10.423586 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.460.80
I1105 13:13:10.423867 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-opticalflow.so.460.80
I1105 13:13:10.423973 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-opencl.so.460.80
I1105 13:13:10.424565 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-ngx.so.460.80
I1105 13:13:10.424589 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-ml.so.460.80
I1105 13:13:10.425298 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-ifr.so.460.80
I1105 13:13:10.425887 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-glvkspirv.so.460.80
I1105 13:13:10.426601 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-glsi.so.460.80
I1105 13:13:10.427363 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-glcore.so.460.80
I1105 13:13:10.428108 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-fbc.so.460.80
I1105 13:13:10.428492 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-encode.so.460.80
I1105 13:13:10.429032 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-eglcore.so.460.80
I1105 13:13:10.430351 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-compiler.so.460.80
I1105 13:13:10.431750 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-cfg.so.460.80
I1105 13:13:10.433088 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-cbl.so.460.80
I1105 13:13:10.433161 21813 nvc_info.c:169] selecting /usr/lib64/libnvidia-allocator.so.460.80
I1105 13:13:10.433760 21813 nvc_info.c:169] selecting /usr/lib64/libnvcuvid.so.460.80
I1105 13:13:10.433887 21813 nvc_info.c:169] selecting /usr/lib64/libcuda.so.460.80
I1105 13:13:10.435023 21813 nvc_info.c:169] selecting /usr/lib64/libGLX_nvidia.so.460.80
I1105 13:13:10.435109 21813 nvc_info.c:169] selecting /usr/lib64/libGLESv2_nvidia.so.460.80
I1105 13:13:10.435596 21813 nvc_info.c:169] selecting /usr/lib64/libGLESv1_CM_nvidia.so.460.80
I1105 13:13:10.436943 21813 nvc_info.c:169] selecting /usr/lib64/libEGL_nvidia.so.460.80
I1105 13:13:10.437539 21813 nvc_info.c:169] selecting /usr/lib/vdpau/libvdpau_nvidia.so.460.80
I1105 13:13:10.437784 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-tls.so.460.80
I1105 13:13:10.438539 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-ptxjitcompiler.so.460.80
I1105 13:13:10.438966 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-opticalflow.so.460.80
I1105 13:13:10.439555 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-opencl.so.460.80
I1105 13:13:10.440396 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-ml.so.460.80
I1105 13:13:10.441003 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-ifr.so.460.80
I1105 13:13:10.441472 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-glvkspirv.so.460.80
I1105 13:13:10.442260 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-glsi.so.460.80
I1105 13:13:10.443326 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-glcore.so.460.80
I1105 13:13:10.444131 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-fbc.so.460.80
I1105 13:13:10.444756 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-encode.so.460.80
I1105 13:13:10.445396 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-eglcore.so.460.80
I1105 13:13:10.445889 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-compiler.so.460.80
I1105 13:13:10.446684 21813 nvc_info.c:169] selecting /usr/lib/libnvidia-allocator.so.460.80
I1105 13:13:10.448093 21813 nvc_info.c:169] selecting /usr/lib/libnvcuvid.so.460.80
I1105 13:13:10.448683 21813 nvc_info.c:169] selecting /usr/lib/libcuda.so.460.80
I1105 13:13:10.450032 21813 nvc_info.c:169] selecting /usr/lib/libGLX_nvidia.so.460.80
I1105 13:13:10.451075 21813 nvc_info.c:169] selecting /usr/lib/libGLESv2_nvidia.so.460.80
I1105 13:13:10.451961 21813 nvc_info.c:169] selecting /usr/lib/libGLESv1_CM_nvidia.so.460.80
I1105 13:13:10.453183 21813 nvc_info.c:169] selecting /usr/lib/libEGL_nvidia.so.460.80
W1105 13:13:10.453198 21813 nvc_info.c:350] missing library libnvidia-nscq.so
W1105 13:13:10.453204 21813 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W1105 13:13:10.453212 21813 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W1105 13:13:10.453219 21813 nvc_info.c:354] missing compat32 library libnvidia-nscq.so
W1105 13:13:10.453224 21813 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W1105 13:13:10.453229 21813 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W1105 13:13:10.453235 21813 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W1105 13:13:10.453242 21813 nvc_info.c:354] missing compat32 library libnvoptix.so
W1105 13:13:10.453252 21813 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I1105 13:13:10.453421 21813 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I1105 13:13:10.453435 21813 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I1105 13:13:10.453448 21813 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I1105 13:13:10.453473 21813 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I1105 13:13:10.453487 21813 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
W1105 13:13:10.453509 21813 nvc_info.c:376] missing binary nv-fabricmanager
I1105 13:13:10.453528 21813 nvc_info.c:438] listing device /dev/nvidiactl
I1105 13:13:10.453533 21813 nvc_info.c:438] listing device /dev/nvidia-uvm
I1105 13:13:10.453543 21813 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I1105 13:13:10.453550 21813 nvc_info.c:438] listing device /dev/nvidia-modeset
W1105 13:13:10.453567 21813 nvc_info.c:321] missing ipc /var/run/nvidia-persistenced/socket
W1105 13:13:10.453581 21813 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket
I1105 13:13:10.453594 21813 nvc_info.c:317] listing ipc /tmp/nvidia-mps
I1105 13:13:10.453599 21813 nvc_info.c:733] requesting device information with ''
I1105 13:13:10.459673 21813 nvc_info.c:623] listing device /dev/nvidia0 (GPU-171fb8c7-b97d-d9c8-c4e7-d1539afe88a7 at 00000000:3e:00.0)
I1105 13:13:10.465735 21813 nvc_info.c:623] listing device /dev/nvidia1 (GPU-13fdefa8-50e9-dfc1-8c93-5a3b7bc4fe01 at 00000000:40:00.0)
I1105 13:13:10.471867 21813 nvc_info.c:623] listing device /dev/nvidia2 (GPU-6f66fecd-a0a1-dff8-0383-8a3cf2b3a4a9 at 00000000:b1:00.0)
I1105 13:13:10.478100 21813 nvc_info.c:623] listing device /dev/nvidia3 (GPU-03fb9423-3d63-c9cd-77e5-7522f3509ee2 at 00000000:b5:00.0)
NVRM version:   460.80
CUDA version:   11.2

Device Index:   0
Device Minor:   0
Model:          GeForce RTX 3090
Brand:          GeForce
GPU UUID:       GPU-171fb8c7-b97d-d9c8-c4e7-d1539afe88a7
Bus Location:   00000000:3e:00.0
Architecture:   8.6

Device Index:   1
Device Minor:   1
Model:          GeForce RTX 3090
Brand:          GeForce
GPU UUID:       GPU-13fdefa8-50e9-dfc1-8c93-5a3b7bc4fe01
Bus Location:   00000000:40:00.0
Architecture:   8.6

Device Index:   2
Device Minor:   2
Model:          GeForce RTX 3090
Brand:          GeForce
GPU UUID:       GPU-6f66fecd-a0a1-dff8-0383-8a3cf2b3a4a9
Bus Location:   00000000:b1:00.0
Architecture:   8.6

Device Index:   3
Device Minor:   3
Model:          GeForce RTX 3090
Brand:          GeForce
GPU UUID:       GPU-03fb9423-3d63-c9cd-77e5-7522f3509ee2
Bus Location:   00000000:b5:00.0
Architecture:   8.6
I1105 13:13:10.478205 21813 nvc.c:423] shutting down library context
I1105 13:13:10.478912 21815 driver.c:163] terminating driver service
I1105 13:13:10.479252 21813 driver.c:203] driver service terminated successfully

uname -a Linux dx-k8sarsenalgpu-42 4.19.159.el7.twl.x86_64+ #1 SMP Fri Dec 4 16:20:07 CST 2020 x86_64 x86_64 x86_64 GNU/Linux

docker version

Client: Docker Engine - Community
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 17:03:45 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:02:21 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.7
  GitCommit:        8fba4e9a7d01810a393d5d25a3621dc101981175
 nvidia:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

nvidia-container-cli -V

version: 1.4.0
build date: 2021-04-24T14:27+0000
build revision: 704a698b7a0ceec07a48e56c37365c741718c2df
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • [x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • [x] Kernel version from uname -a
  • [x] Any relevant kernel output lines from dmesg
  • [x] Driver information from nvidia-smi -a
  • [x] Docker version from docker version
  • [x] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • [x] NVIDIA container library version from nvidia-container-cli -V
  • [x] NVIDIA container library logs (see troubleshooting)
  • [x] Docker command, image and tag used

MoFHeka avatar Nov 05 '21 13:11 MoFHeka