ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Singleton init mode, MPI_Comm_spawn, and OPAL_PREFIX

Open dalcinl opened this issue 1 year ago • 22 comments
trafficstars

I'm working on producing a binary Open MPI Python wheel to allow for pip install openmpi. I had to add a few hacks here and there to overcome a few issues strictly related to how wheel files are stored, how they are installed in Python virtual environments, and the lack of post-install hooks, and the expectations of things to work without activating the environment . All this of course requires relocating the Open MPI installation. And at that point I've found and minor issue that I don't know how to overcome.

First, a clarification: I'm using internal PMIX and PRRTE. Supposedly, OPAL_PREFIX env var is all what is needed for things to work out of the box when relocating an Open MPI install. However, I think I came across a corner case. If using singleton init mode, then I believe OPAL_PREFIX is simply ignored, and if tools are not located via $PATH, then things do not work.

Looking at the spawn code, I see a function start_dvm in ompi/dpm/dpm.c. This function start_dvm has the following code:

    /* find the prte binary using the install_dirs support - this also
     * checks to ensure that we can see this executable and it *is* executable by us
     */
    cmd = opal_find_absolute_path("prte");

However, I believe opal_find_absolute_path() does not care at all about the OPAL_PREFIX env var, it only uses PATH, eventually. The comment find the prte binary using the install_dirs support is simply not true.

@hppritcha Your input is much appreciated here. I do have a reproducer, but so far it is based on Python and intermediate binary assets I'm generating locally. I hope the description above is enough for you to realize the issue.

dalcinl avatar Feb 19 '24 16:02 dalcinl

I'll take a look at adjusting the path search. Having enough problems with #12307 at the moment.

hppritcha avatar Feb 19 '24 17:02 hppritcha

I'll take a look at adjusting the path search. Having enough problems with #12307 at the moment.

Sorry! :cry:

dalcinl avatar Feb 19 '24 17:02 dalcinl

@dalcinl Just to be clear is this a problem on main branch as well?

wenduwan avatar Feb 22 '24 17:02 wenduwan

@wenduwan I discovered this issue testing with the 5.0.2 tarball. I cannot assert anything about the main branch, but a quick glance at the code seems to indicate that the root issue is also in there.

dalcinl avatar Feb 22 '24 17:02 dalcinl

Adding main label so we don't ignore it for e.g. 5.1.x

wenduwan avatar Feb 22 '24 17:02 wenduwan

@jsquyres Would Open MPI accept platform-specific ways to derive OPAL_PREFIX automatically, alleviating users from the need to set it explicitly?

dalcinl avatar Feb 22 '24 17:02 dalcinl

All that is needed is to translate OPAL_PREFIX to PRTE_PREFIX in mpirun - but you can only do that if you are using the internal PRRTE. Same is true for PMIX_PREFIX.

rhc54 avatar Feb 22 '24 18:02 rhc54

If this is about singleton comm-spawn, then you need to put the translation into the bottom of the dpm.c file before we fork/exec prte. Again, with external PRRTE/PMIx protected.

rhc54 avatar Feb 22 '24 18:02 rhc54

@rhc54 Arent you are oversimplifying things a bit? opal_find_absolute_path("prte") will keep ignoring any XXX_PREFIX variable you may set, as it only looks in PATH.

I know almost nothing about the ompi codebase, but it looks to me that you have to, after the environment var translation you described, somehow replicate the logics of find_prterun() in ompi/tools/mpirun/main.c [link]. I may be saying nonsense, in that case just ignore me.

dalcinl avatar Feb 22 '24 20:02 dalcinl

I wasn't oversimplifying things - just pointing out that this isn't a difficult problem. Sure, you need to update the dpm code to duplicate that in mpirun, but that's a copy/paste operation. The code in mpirun is actually doing the "right thing" - it's just the dpm that is out of date.

rhc54 avatar Feb 22 '24 21:02 rhc54

Try #12363 - took longer to compile then to do 🤷‍♂️

rhc54 avatar Feb 22 '24 23:02 rhc54

This issue is losing steam. @dalcinl I wonder if the issue is covered by mpi4py CI? The ompi mpi4py looks green now. Is this still a problem?

wenduwan avatar May 02 '24 15:05 wenduwan

@wenduwan I believe this issue is already fixed in main, at least in my local testing. I'm still getting some warnings, though, but these are unrelated:

[kw61149:2141245] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.kw61149.1000/jf.0/1892286464/shared_mem_cuda_pool.kw61149 could be created.
[kw61149:2141245] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 

And no, the mpi4py CI tests do not specifically test for this issue. After all the work @jsquyres put into it, adding a new workflow should be quite easy if you really want to have a quick test for relocation via OPAL_PREFIX+LD_LIBRARY_PATH.

dalcinl avatar May 04 '24 08:05 dalcinl

The warnings are real. @edgargabriel and @lrbison both observed them in singleton mode.

I think adding a minimal relocation test is reasonable. IIUC the test will look like the following:

  1. Install Open MPI as normal
  2. Install mpi4py as normal
  3. Move Open MPI installation to another location, e.g. mv /original/install /new/install
  4. export OPAL_PREFIX=/new/install/...
  5. export LD_LIBRARY_PATH=/new/install/lib:$LD_LIBRARY_PATH
  6. Run tests with /new/install/bin/mpirun ... -x LD_LIBRARY_PATH -x OPAL_PREFIX

Please correct me if I'm wrong.

wenduwan avatar May 06 '24 14:05 wenduwan

Not sure I understand this report. Are you saying running mpirun generates a warning? That isn't a singleton issue, so what precisely are you saying is wrong?

rhc54 avatar May 06 '24 14:05 rhc54

no, its the other way around. With mpirun there is no warning, only with singleton

edgargabriel avatar May 06 '24 15:05 edgargabriel

Thanks - and the warning is still there on head of current main?

rhc54 avatar May 06 '24 15:05 rhc54

I think so, I think I saw it a few days ago. I can double check in a few days and look into it, its coming from the btl.smcuda component I think

edgargabriel avatar May 06 '24 15:05 edgargabriel

You see this smcuda warning any time the MPI process doesn't have a prrte pmix server to talk to. So singleton and direct launched with srun (and likely palsrun) also give the warning, one instance per node of the job as I recall. you don't see this with btl/sm because it has a fallback if there doesn't appear to be a pmix server created session dir.

hppritcha avatar May 06 '24 15:05 hppritcha

Yeah, this has nothing to do with moving the installation. I had posted a patch to silence this warning, but it ran into a bunch of nit-picking and so I deleted it. It's a trivial fix to silence it, if someone else wants to try to get it committed.

rhc54 avatar May 06 '24 16:05 rhc54

@wenduwan #12526

dalcinl avatar May 06 '24 17:05 dalcinl

@dalcinl Thank you very much!

wenduwan avatar May 06 '24 20:05 wenduwan