software-layer
software-layer copied to clipboard
Allow Nvidia driver script to set LD_PRELOAD
Instance eessi-bot-mc-aws is configured to build for:
- architectures:
x86_64/generic,x86_64/intel/haswell,x86_64/intel/skylake_avx512,x86_64/amd/zen2,x86_64/amd/zen3,aarch64/generic,aarch64/neoverse_n1,aarch64/neoverse_v1 - repositories:
eessi.io-2023.06-compat,eessi-hpc.org-2023.06-software,eessi-hpc.org-2023.06-compat,eessi.io-2023.06-software
Instance boegel-bot-deucalion is configured to build for:
- architectures:
aarch64/a64fx - repositories:
eessi.io-2023.06-software
Instance eessi-bot-mc-azure is configured to build for:
- architectures:
x86_64/amd/zen4 - repositories:
eessi-hpc.org-2023.06-software,eessi-hpc.org-2023.06-compat,eessi.io-2023.06-software,eessi.io-2023.06-compat
Example output:
[rocky@ip-172-31-27-81 software-layer]$ ./scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh --ld-preload --no-download
Found NVIDIA GPU driver version 545.23.08
Found host CUDA version 12.3
Using default list of libraries
Matched 48 CUDA Libraries
When attempting to use LD_PRELOAD we exclude anything related to graphics
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGL.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGL.so.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX_nvidia.so.0.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so.0.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so.
libwayland-server.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-egl-wayland.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so.
libXNVCtrl.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-gtk3.so.545.23.08.
The recommended way to use LD_PRELOAD is to only use it when you need to:
export EESSI_GPU_LD_PRELOAD="/lib64/libcuda.so.1:/lib64/libcuda.so:/lib64/libcudadebugger.so.1:/lib64/libnvcuvid.so.1:/lib64/libnvcuvid.so:/lib64/libnvidia-cfg.so.1:/lib64/libnvidia-cfg.so:/lib64/libnvidia-eglcore.so.545.23.08:/lib64/libnvidia-encode.so.1:/lib64/libnvidia-encode.so:/lib64/libnvidia-glcore.so.545.23.08:/lib64/libnvidia-glsi.so.545.23.08:/lib64/libnvidia-glvkspirv.so.545.23.08:/lib64/libnvidia-gpucomp.so.545.23.08:/lib64/libnvidia-ml.so.1:/lib64/libnvidia-ml.so:/lib64/libnvidia-nvvm.so.4:/lib64/libnvidia-nvvm.so:/lib64/libnvidia-opencl.so.1:/lib64/libnvidia-opticalflow.so.1:/lib64/libnvidia-ptxjitcompiler.so.1:/lib64/libnvidia-ptxjitcompiler.so:/lib64/libnvidia-rtcore.so.545.23.08:/lib64/libnvidia-tls.so.545.23.08:/lib64/libnvoptix.so.1:/lib64/libOpenCL.so.1"
export EESSI_OVERRIDE_GPU_CHECK="1"
Then you can set LD_PRELOAD only when you want to run a GPU application, e.g.,
LD_PRELOAD="$EESSI_GPU_LD_PRELOAD" device_query
@ocaisa There's duplicate entries here, libcuda.so is a symlink for libcuda.so.1, only one is needed
This is resulting in about 400MB of preload:
{EESSI 2023.06} [rocky@ip-172-31-20-85 software-layer]$ IFS=':'; for path in $EESSI_GPU_LD_PRELOAD; do ls -lh $path; done; unset IFS
-rwxr-xr-x 1 root root 29M Nov 6 2023 /usr/lib64/libcuda.so.545.23.08
-rwxr-xr-x 1 root root 11M Nov 6 2023 /usr/lib64/libcudadebugger.so.545.23.08
-rwxr-xr-x 1 root root 9.6M Nov 6 2023 /usr/lib64/libnvcuvid.so.545.23.08
-rwxr-xr-x 1 root root 269K Nov 6 2023 /usr/lib64/libnvidia-cfg.so.545.23.08
-rwxr-xr-x 1 root root 566K Nov 6 2023 /usr/lib64/libnvidia-glsi.so.545.23.08
-rwxr-xr-x 1 root root 8.7M Nov 6 2023 /usr/lib64/libnvidia-glvkspirv.so.545.23.08
-rwxr-xr-x 1 root root 42M Nov 7 2023 /usr/lib64/libnvidia-gpucomp.so.545.23.08
-rwxr-xr-x 1 root root 1.9M Nov 6 2023 /usr/lib64/libnvidia-ml.so.545.23.08
-rwxr-xr-x 1 root root 83M Nov 7 2023 /usr/lib64/libnvidia-nvvm.so.545.23.08
-rwxr-xr-x 1 root root 24M Nov 6 2023 /usr/lib64/libnvidia-opencl.so.545.23.08
-rwxr-xr-x 1 root root 26M Nov 6 2023 /usr/lib64/libnvidia-ptxjitcompiler.so.545.23.08
-rwxr-xr-x 1 root root 103M Nov 7 2023 /usr/lib64/libnvidia-rtcore.so.545.23.08
-rwxr-xr-x 1 root root 19K Nov 6 2023 /usr/lib64/libnvidia-tls.so.545.23.08
-rwxr-xr-x 1 root root 58M Nov 7 2023 /usr/lib64/libnvoptix.so.545.23.08
-rwxr-xr-x 1 root root 131K Apr 12 2021 /usr/lib64/libOpenCL.so.1.0.0
@boegel I've played with this a lot today and I'm happy with the functionality now:
{EESSI 2023.06} [rocky@ip-172-31-20-85 software-layer]$ ./scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh --no-download --ld-preload
Found host CUDA version 7.5
Found NVIDIA GPU driver version 545.23.08
Using default list of libraries
Matched 48 CUDA Libraries
When attempting to use LD_PRELOAD we exclude anything related to graphics
Match found for libcuda.so for CUDA compat libraries
Match found for libcudadebugger.so for CUDA compat libraries
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libEGL.so.1
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libEGL.so
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libGLESv1_CM.so.1
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libGLESv1_CM.so
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libGLESv2.so.2
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libGLESv2.so
libGLX.so.0 is NOT in the provided preload list, filtering /lib64/libGL.so.1
libGLX.so.0 is NOT in the provided preload list, filtering /lib64/libGL.so
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX_nvidia.so.0
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so.0
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so
libwayland-server.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-egl-wayland.so.1
libnvcuvid.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-encode.so.1
libnvcuvid.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-encode.so
libGL.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so.1
libGL.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so
libXNVCtrl.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-gtk3.so.545.23.08
Match found for libnvidia-nvvm.so for CUDA compat libraries
libnvcuvid.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-opticalflow.so.1
Match found for libnvidia-ptxjitcompiler.so for CUDA compat libraries
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libOpenGL.so.0
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libOpenGL.so
The recommended way to use LD_PRELOAD is to only use it when you need to.
A minimal preload which should work in most cases:
export EESSI_GPU_COMPAT_LD_PRELOAD="/usr/lib64/libcuda.so.545.23.08:/usr/lib64/libcudadebugger.so.545.23.08:/usr/lib64/libnvidia-nvvm.so.545.23.08:/usr/lib64/libnvidia-ptxjitcompiler.so.545.23.08"
A corner-case full preload (which is hard on memory) for exceptional use:
export EESSI_GPU_LD_PRELOAD="/usr/lib64/libcuda.so.545.23.08:/usr/lib64/libcudadebugger.so.545.23.08:/usr/lib64/libEGL_nvidia.so.545.23.08:/usr/lib64/libGLdispatch.so.0.0.0:/usr/lib64/libGLESv1_CM_nvidia.so.545.23.08:/usr/lib64/libGLESv2_nvidia.so.545.23.08:/usr/lib64/libnvcuvid.so.545.23.08:/usr/lib64/libnvidia-cfg.so.545.23.08:/usr/lib64/libnvidia-eglcore.so.545.23.08:/usr/lib64/libnvidia-glcore.so.545.23.08:/usr/lib64/libnvidia-glsi.so.545.23.08:/usr/lib64/libnvidia-glvkspirv.so.545.23.08:/usr/lib64/libnvidia-gpucomp.so.545.23.08:/usr/lib64/libnvidia-ml.so.545.23.08:/usr/lib64/libnvidia-nvvm.so.545.23.08:/usr/lib64/libnvidia-opencl.so.545.23.08:/usr/lib64/libnvidia-ptxjitcompiler.so.545.23.08:/usr/lib64/libnvidia-rtcore.so.545.23.08:/usr/lib64/libnvidia-tls.so.545.23.08:/usr/lib64/libnvoptix.so.545.23.08:/usr/lib64/libOpenCL.so.1.0.0"
export EESSI_OVERRIDE_GPU_CHECK="1"
Then you can set LD_PRELOAD only when you want to run a GPU application, e.g.,
LD_PRELOAD="$EESSI_GPU_COMPAT_LD_PRELOAD" device_query
bot: build repo:eessi.io-2023.06-software arch:x86_64/generic
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/genericfromocaisa- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/generic
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/genericresulted in:- submitted job
23806, for details & status see https://github.com/EESSI/software-layer/pull/754#issuecomment-2419419685
- submitted job
Updates by the bot instance boegel-bot-deucalion
(click for details)
- account
ocaisahas NO permission to send commands to the bot
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/genericfromocaisa- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/generic
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/genericresulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.10/pr_754/23806
| date | job status | comment |
|---|---|---|
| Oct 17 12:33:27 UTC 2024 | submitted | job id 23806 awaits release by job manager |
| Oct 17 12:33:30 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Oct 17 12:34:35 UTC 2024 | running | job 23806 is running |
| Oct 17 12:40:51 UTC 2024 | finished | :grin: SUCCESS (click triangle for details)
|
| Oct 17 12:40:51 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Also tested the script within eessi_container :
Found host CUDA version 9.0
Found NVIDIA GPU driver version 535.129.03
Using downloaded list of libraries
Matched 41 CUDA Libraries
The host GPU driver libraries (v535.129.03) have already been linked! (based on /cvmfs/software.eessi.io/host_injections/nvidia/aarch64/host/driver_version.txt)
Successfully created symlink between /cvmfs/software.eessi.io/host_injections/nvidia/aarch64/latest and lib in /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/aarch64
Host NVIDIA GPU drivers linked successfully for EESSI
@TopRichard This will need to be re-tested now to make sure the changes haven't had an unintended impact
bot: build repo:eessi.io-2023.06-software arch:x86_64/generic
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/genericfromocaisa- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/generic
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/genericresulted in:- submitted job
40795, for details & status see https://github.com/EESSI/software-layer/pull/754#issuecomment-2595764166
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/genericfromocaisa- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/generic
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/genericresulted in:- no jobs were submitted
Updates by the bot instance eessi-bot-vsc-ugent
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/genericfromocaisa- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/generic
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/genericresulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.01/pr_754/40795
| date | job status | comment |
|---|---|---|
| Jan 16 13:57:47 UTC 2025 | submitted | job id 40795 awaits release by job manager |
| Jan 16 13:57:55 UTC 2025 | released | job awaits launch by Slurm scheduler |
| Jan 16 14:02:58 UTC 2025 | running | job 40795 is running |
| Jan 16 14:10:06 UTC 2025 | finished | :grin: SUCCESS (click triangle for details)
|
| Jan 16 14:10:06 UTC 2025 | test result | :grin: SUCCESS (click triangle for details)
|
| Jan 17 10:59:44 UTC 2025 | uploaded | transfer of eessi-2023.06-software-linux-x86_64-generic-1737036231.tar.gz to S3 bucket succeeded |
@TopRichard This will need to be re-tested now to make sure the changes haven't had an unintended impact
re-testing:
Apptainer> /cvmfs/software.eessi.io/versions/2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
Found host CUDA version 9.0
Found NVIDIA GPU driver version 535.129.03
Using downloaded list of libraries
Matched 41 CUDA Libraries
Successfully created symlink between latest and host in /cvmfs/software.eessi.io/host_injections/nvidia/aarch64
Successfully created symlink between /cvmfs/software.eessi.io/host_injections/nvidia/aarch64/latest and lib in /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/aarch64
Host NVIDIA GPU drivers linked successfully for EESSI
@bedroge This was deployed, so PR should be merged too?
@bedroge This was deployed, so PR should be merged too?
Yes, the tarball has been ingested.
PR merged! Moved ['/project/def-users/SHARED/jobs/2024.10/pr_754/23806', '/project/def-users/SHARED/jobs/2025.01/pr_754/40795'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.01.17
PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.01.17