ompi
ompi copied to clipboard
Newer versions of OpenMPI are unable to locate CUDA support.
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
The bug exists in 5.01, I am unaware if it also exists for previous, or subsequent releases.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
This issue exists in source tarball and gitclone, I've tested both.
Please describe the system on which you are running
Two node system
- Operating system/version: RHEL 9.2 (Plow)
- Computer hardware: x86_64
Details of the problem
I used to be able to get CUDA support with OpenMPI by simply providing the --with-cuda=/usr/local/cuda option at OMPI configure. Now it seems I also require the with-cuda-libdir Without this additional flag, it appears as if there is no support for NVIDIA devices,CUDA support: no. I believe this will cause problems for users when they re-build OMPI to a newer version and suddenly see their CUDA support is non-existent.
As @hppritcha pointed out, this is indeed documented https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html.
@tmh97 Per the Webex today, could you provide a little more info? E.g.:
- As you stated above, running
./configure --with-cuda=/usr/local/cuda ...fails to find CUDA support. - Does running
./configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuds/lib64 ...work?- I.e., what is the specific libdir that you provide to
--with-cuda-libdirthat makes this work?
- I.e., what is the specific libdir that you provide to
@jsquyres --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/ worked well for OpenMPI 5.0.0/1
It seems /usr/local/cuda/lib64 is where the CUDA runtime API resides. I believe this is the path we wish to target.
Alternatively, /usr/lib64 also contains CUDA related files, but I believe these are for the CUDA driver API, which is not what we want (i think)
Do we know that that is correct?
- You're saying
--with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64works. - But the docs @hppritcha cited state that
--with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubsis correct. I.e., thisstubsfolder at the end of the libdir is needed.
Given that the docs were specifically written that way, is it correct to assume that there is a reason stubs is the correct way, and not including stubs in the libdir is wrong for some reason?
Alternatively, @edgargabriel stated today on the call that configuring --with-luster=/blah didn't work to find the Lustre libraries in /blah/lib64.
@edgargabriel Can you confirm that this is correct / what is currently happening on main and v5.0.x?
on Ubuntu 20.04, I need:
# Open MPI 4.1:
./configure --with-cuda
# Open MPI main:
./configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs
Note that I seem to need to specify paths for both cuda and cuda-libdir. Adding a path for libdir alone was not enough.
Yes, having to specify both --with-cuda and --with-cuda-libdir is expected. I'm asking if the stubs part is really necessary -- the docs were clearly written that way on purpose. And why does not specifying stubs work for @tmh97?
I went back to the cluster with the lustre file system, and I can see clearly in bash_history that I configured for a while Open MPI with the -with-lustre=/opt/lustre/2.12.2 --with-lustre-libdir=/opt/lustre/2.12.2/lib64 arguments, and since I didn't use to do that in the past, it was probably because it wasn't working without that (and that is what I also remembered).
However, as of right now, it looks like I don't need to set the --with-lustre-libdir anymore, it configures correctly again without having to provide that argument.
Ok, so then this question really is just about --with-cuda -- not the general OAC --with-FOO handling.
- Is it incorrect to not specify the
stubsfolder in the--with-cuda-libdir? (the docs imply thatstubsis necessary) - Can
config/opal_check_cuda.m4be updated to automagically handle searching forstubs?
The stubs point to a libcuda.so that allows linking CUDA applications using the driver API (such as OMPI) on platforms without GPUs. This is different from what other libraries require, but there are valid reasons. I'll vote for automatically checking for the stubs in config/opal_check_cuda.m4.
I'll vote for automatically checking for the stubs in
config/opal_check_cuda.m4.
Cool. Can someone in NVIDIA look into this? Hint, hint. 😄
fixed with #12382