ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Newer versions of OpenMPI are unable to locate CUDA support.

Open tmh97 opened this issue 1 year ago • 10 comments

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

The bug exists in 5.01, I am unaware if it also exists for previous, or subsequent releases.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

This issue exists in source tarball and gitclone, I've tested both.

Please describe the system on which you are running

Two node system

  • Operating system/version: RHEL 9.2 (Plow)
  • Computer hardware: x86_64

Details of the problem

I used to be able to get CUDA support with OpenMPI by simply providing the --with-cuda=/usr/local/cuda option at OMPI configure. Now it seems I also require the with-cuda-libdir Without this additional flag, it appears as if there is no support for NVIDIA devices,CUDA support: no. I believe this will cause problems for users when they re-build OMPI to a newer version and suddenly see their CUDA support is non-existent.

tmh97 avatar Jan 22 '24 23:01 tmh97

As @hppritcha pointed out, this is indeed documented https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html.

tmh97 avatar Jan 23 '24 16:01 tmh97

@tmh97 Per the Webex today, could you provide a little more info? E.g.:

  • As you stated above, running ./configure --with-cuda=/usr/local/cuda ... fails to find CUDA support.
  • Does running ./configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuds/lib64 ... work?
    • I.e., what is the specific libdir that you provide to --with-cuda-libdir that makes this work?

jsquyres avatar Jan 23 '24 20:01 jsquyres

@jsquyres --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/ worked well for OpenMPI 5.0.0/1

It seems /usr/local/cuda/lib64 is where the CUDA runtime API resides. I believe this is the path we wish to target.

Alternatively, /usr/lib64 also contains CUDA related files, but I believe these are for the CUDA driver API, which is not what we want (i think)

tmh97 avatar Jan 23 '24 21:01 tmh97

Do we know that that is correct?

  • You're saying --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64 works.
  • But the docs @hppritcha cited state that --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs is correct. I.e., this stubs folder at the end of the libdir is needed.

Given that the docs were specifically written that way, is it correct to assume that there is a reason stubs is the correct way, and not including stubs in the libdir is wrong for some reason?

Alternatively, @edgargabriel stated today on the call that configuring --with-luster=/blah didn't work to find the Lustre libraries in /blah/lib64.

@edgargabriel Can you confirm that this is correct / what is currently happening on main and v5.0.x?

jsquyres avatar Jan 23 '24 21:01 jsquyres

on Ubuntu 20.04, I need:

# Open MPI 4.1:
./configure --with-cuda
# Open MPI main:
./configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs

Note that I seem to need to specify paths for both cuda and cuda-libdir. Adding a path for libdir alone was not enough.

lrbison avatar Jan 23 '24 21:01 lrbison

Yes, having to specify both --with-cuda and --with-cuda-libdir is expected. I'm asking if the stubs part is really necessary -- the docs were clearly written that way on purpose. And why does not specifying stubs work for @tmh97?

jsquyres avatar Jan 23 '24 21:01 jsquyres

I went back to the cluster with the lustre file system, and I can see clearly in bash_history that I configured for a while Open MPI with the -with-lustre=/opt/lustre/2.12.2 --with-lustre-libdir=/opt/lustre/2.12.2/lib64 arguments, and since I didn't use to do that in the past, it was probably because it wasn't working without that (and that is what I also remembered).

However, as of right now, it looks like I don't need to set the --with-lustre-libdir anymore, it configures correctly again without having to provide that argument.

edgargabriel avatar Jan 23 '24 22:01 edgargabriel

Ok, so then this question really is just about --with-cuda -- not the general OAC --with-FOO handling.

  1. Is it incorrect to not specify the stubs folder in the --with-cuda-libdir? (the docs imply that stubs is necessary)
  2. Can config/opal_check_cuda.m4 be updated to automagically handle searching for stubs?

jsquyres avatar Jan 23 '24 22:01 jsquyres

The stubs point to a libcuda.so that allows linking CUDA applications using the driver API (such as OMPI) on platforms without GPUs. This is different from what other libraries require, but there are valid reasons. I'll vote for automatically checking for the stubs in config/opal_check_cuda.m4.

bosilca avatar Jan 23 '24 22:01 bosilca

I'll vote for automatically checking for the stubs in config/opal_check_cuda.m4.

Cool. Can someone in NVIDIA look into this? Hint, hint. 😄

jsquyres avatar Jan 24 '24 14:01 jsquyres

fixed with #12382

janjust avatar Mar 06 '24 16:03 janjust