ompi icon indicating copy to clipboard operation
ompi copied to clipboard

OMPI v4.x fails when built against PMIx v5

Open rhc54 opened this issue 3 years ago • 6 comments

The folks at SchedMD have been running into a problem when running OMPI v4.x applications against Slurm when both are built against PMIx v5. They also verified that the same problem exists when running OMPI v4.x applications using the OMPI mpirun cmd if OMPI v4.x is built against PMIx v5.

So it looks like there is something in the OMPI v4.x pmix integration that has an issue with PMIx v5. The problem is reported as:

The opmi 4.1 error is this...

srun --mpi=pmix_v5 -n1 helloworld
srun: error: snowflake7: task 0: Exited with exit code 1
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIX stopped checking at the first component that it did not find.

Host:      snowflake
Framework: psec
Component: munge
--------------------------------------------------------------------------

--------------------------------------------------------------------------
It looks like pmix_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during pmix_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
PMIX developer):

  pmix_psec_base_open failed
  --> Returned value -46 instead of PMIX_SUCCESS
--------------------------------------------------------------------------

[snowflake:1309404] PMIX ERROR: NOT-FOUND in file ../../../../../../../../ompi/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client.c at line 562
[snowflake:1309404] OPAL ERROR: Not found in file ../../../../../../ompi/opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[snowflake:1309404] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Note that they thought they had built OMPI v4 against the external PMIx v5 installation - yet the error output is coming from the embedded PMIx v3 code. So it isn't clear if the problem is in the configure code or the pmix integration component.

Tests run using OMPI master and OMPI v5 release branch are all successful, so this seems to be something constrained to the OMPI v4 series.

rhc54 avatar Apr 21 '22 17:04 rhc54

@dannyauble @wickberg Looping you into this as you uncovered the problem and might be able to answer questions about it.

rhc54 avatar Apr 21 '22 22:04 rhc54

I am also seeing the munge not found and other issues with OMPI 4.1.5 and 4.1.6rc2. Was only solvable by using an external pmix v4.2.6 when building OMPI. The pmix built externally can see all munge libs no problem.

miesav avatar Sep 21 '23 03:09 miesav

OMPI v4.x uses an internal copy of PMIx v3.y, not PMIx v4.y. So it is entirely possible that the munge discovery logic has changed across PMIx major release families.

However, please note the original reported problem:

Note that they thought they had built OMPI v4 against the external PMIx v5 installation - yet the error output is coming from the embedded PMIx v3 code. So it isn't clear if the problem is in the configure code or the pmix integration component.

rhc54 avatar Sep 21 '23 22:09 rhc54

Hey @rhc54, we've solved the originally-reported issue, right (va 6e8e14f2c2)? I.e., should we close this?

jsquyres avatar Sep 24 '23 13:09 jsquyres

Not entirely resolved - the biggest question is this:

Note that they thought they had built OMPI v4 against the external PMIx v5 installation - yet the error output is coming from the embedded PMIx v3 code. So it isn't clear if the problem is in the configure code or the pmix integration component.

In other words, why is the internal component active when OMPI v4.1.x is configured against an external PMIx v5 installation?

rhc54 avatar Sep 24 '23 21:09 rhc54

I would suggest someone start by confirming that this external vs internal thing is really happening.

If you can resolve the external vs internal question, then the question devolves to whether or not the internal copy of PMIx v3.x is being built with munge support when munge is present on the system. It should build by default if munge is present. The error message implies that this is not happening.

Neither of those issues was impacted by the cited commit.

rhc54 avatar Sep 24 '23 21:09 rhc54