ompi icon indicating copy to clipboard operation
ompi copied to clipboard

OMPI v5.0.0RC6 with ORTE fails to compile when built against PMIx 3.2.3

Open robert-mijakovic opened this issue 3 years ago • 10 comments

Background information

What version of Open MPI are you using?

  • OpenMPI v5.0.0RC6 with GCC 10.3.0.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  • The package was installed from the distribution tarball of v5.0.0RC6.

Please describe the system on which you are running

  • Operating system/version: Rocky Linux 8.5
  • Computer hardware: BullSequana XH2000, 2xAMD EPYC 7H12 64C 2.6GHz
  • Network type: Mellanox HDR200 InfiniBand/ParTec ParaStation ClusterSuite

Details of the problem

Dependencies:

  • Compiler: GCC 10.3.0
  • hwloc 2.7.1
  • libevent 2.1.12
  • UCX 1.12.1
  • libfabric 1.14.0
  • PMIx 3.2.3
  • HCOLL 4.7.3202
  • xpmem 2.6.5-36
  • knem 1.1.4

OpenMPI 5.0.0RC6 with ORTE failed to compile when built against PMIx 3.2.3. The decision is based on the fact that our SLURM was built against that version too. Since PRRTE is only available with PMIx 4.x, I had to disable it and enabled ORTE. However, as you can see below, it failed to compile because definitions from PMIx 4.x seem to be necessary.

The build is configured using the following options:

$ ./configure --prefix=/apps/USE/easybuild/staging/2021.1/software/OpenMPI/5.0.0-GCC-10.3.0  --build=x86_64-pc-linux-gnu  --host=x86_64-pc-linux-gnu --without-prrte --with-libevent-libdir=$EBROOTLIBEVENT/lib --with-pmix-libdir=$EBROOTPMIX/lib --with-hwloc-libdir=$EBROOTHWLOC/lib --with-ofi-libdir=$EBROOTOFI/lib --with-ucx-libdir=$EBROOTUCX/lib --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default  --enable-shared  --with-cuda=no 
  CC       instance/instance.lo
In file included from /apps/USE/easybuild/staging/2021.1/software/PMIx/3.2.3-GCCcore-10.3.0/include/pmix.h:52,
                 from ../opal/mca/pmix/pmix-internal.h:47,
                 from ../opal/mca/pmix/base/base.h:23,
                 from communicator/comm_cid.c:39:
communicator/comm_cid.c: In function ompi_comm_ext_cid_new_block:
communicator/comm_cid.c:339:28: error: PMIX_GROUP_ASSIGN_CONTEXT_ID undeclared (first use in this function)
  339 |     PMIX_INFO_LOAD(&pinfo, PMIX_GROUP_ASSIGN_CONTEXT_ID, NULL, PMIX_BOOL);
      |                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/apps/USE/easybuild/staging/2021.1/software/PMIx/3.2.3-GCCcore-10.3.0/include/pmix_common.h:1566:22: note: in definition of macro PMIX_INFO_LOAD
 1566 |         if (NULL != (k)) {                                  \
      |                      ^
communicator/comm_cid.c:339:28: note: each undeclared identifier is reported only once for each function it appears in
  339 |     PMIX_INFO_LOAD(&pinfo, PMIX_GROUP_ASSIGN_CONTEXT_ID, NULL, PMIX_BOOL);
      |                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/apps/USE/easybuild/staging/2021.1/software/PMIx/3.2.3-GCCcore-10.3.0/include/pmix_common.h:1566:22: note: in definition of macro PMIX_INFO_LOAD
 1566 |         if (NULL != (k)) {                                  \
      |                      ^
communicator/comm_cid.c:346:10: warning: implicit declaration of function PMIx_Group_construct [-Wimplicit-function-declaration]
  346 |     rc = PMIx_Group_construct(tag, procs, proc_count, &pinfo, 1, &results, &nresults);
      |          ^~~~~~~~~~~~~~~~~~~~
communicator/comm_cid.c:357:10: warning: implicit declaration of function PMIx_Group_destruct [-Wimplicit-function-declaration]
  357 |     rc = PMIx_Group_destruct (tag, NULL, 0);
      |          ^~~~~~~~~~~~~~~~~~~
make[2]: *** [Makefile:2595: communicator/comm_cid.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
instance/instance.c: In function ompi_mpi_instance_init_common:
instance/instance.c:424:55: error: PMIX_ERR_LOST_CONNECTION undeclared (first use in this function); did you mean PMIX_ERR_LOST_PEER_CONNECTION?
  424 |     pmix_status_t codes[2] = { PMIX_ERR_PROC_ABORTED, PMIX_ERR_LOST_CONNECTION };
      |                                                       ^~~~~~~~~~~~~~~~~~~~~~~~
      |                                                       PMIX_ERR_LOST_PEER_CONNECTION
instance/instance.c:424:55: note: each undeclared identifier is reported only once for each function it appears in
runtime/ompi_rte.c: In function ompi_rte_breakpoint:
runtime/ompi_rte.c:1071:20: error: PMIX_DEBUGGER_RELEASE undeclared (first use in this function); did you mean PMIX_ERR_DEBUGGER_RELEASE?
 1071 |     int rc, code = PMIX_DEBUGGER_RELEASE;
      |                    ^~~~~~~~~~~~~~~~~~~~~
      |                    PMIX_ERR_DEBUGGER_RELEASE
runtime/ompi_rte.c:1071:20: note: each undeclared identifier is reported only once for each function it appears in
runtime/ompi_rte.c:1094:10: warning: implicit declaration of function PMIX_CHECK_RANK; did you mean PMIX_PROC_RANK? [-Wimplicit-function-declaration]
 1094 |     if (!PMIX_CHECK_RANK(u32, opal_process_info.myprocid.rank)) {
      |          ^~~~~~~~~~~~~~~
      |          PMIX_PROC_RANK
instance/instance.c: In function ompi_instance_get_num_psets_complete:
instance/instance.c:949:37: error: PMIX_QUERY_NUM_PSETS undeclared (first use in this function); did you mean PMIX_QUERY_NAMESPACES?
  949 |         if (0 == strcmp(info[n].key,PMIX_QUERY_NUM_PSETS)) {
      |                                     ^~~~~~~~~~~~~~~~~~~~
      |                                     PMIX_QUERY_NAMESPACES
In file included from /apps/USE/easybuild/staging/2021.1/software/PMIx/3.2.3-GCCcore-10.3.0/include/pmix.h:52,
                 from ../opal/mca/pmix/pmix-internal.h:47,
                 from ../opal/util/proc.h:26,
                 from runtime/ompi_rte.c:47:
runtime/ompi_rte.c:1106:30: error: PMIX_BREAKPOINT undeclared (first use in this function)
 1106 |     PMIX_INFO_LOAD(&info[1], PMIX_BREAKPOINT, "mpi-init", PMIX_STRING);
      |                              ^~~~~~~~~~~~~~~
/apps/USE/easybuild/staging/2021.1/software/PMIx/3.2.3-GCCcore-10.3.0/include/pmix_common.h:1566:22: note: in definition of macro PMIX_INFO_LOAD
 1566 |         if (NULL != (k)) {                                  \
      |                      ^
instance/instance.c:964:46: error: PMIX_QUERY_PSET_NAMES undeclared (first use in this function); did you mean PMIX_QUERY_CREATE?
  964 |         } else if (0 == strcmp (info[n].key, PMIX_QUERY_PSET_NAMES)) {
      |                                              ^~~~~~~~~~~~~~~~~~~~~
      |                                              PMIX_QUERY_CREATE
runtime/ompi_rte.c:1107:23: error: PMIX_READY_FOR_DEBUG undeclared (first use in this function)
 1107 |     PMIx_Notify_event(PMIX_READY_FOR_DEBUG,
      |                       ^~~~~~~~~~~~~~~~~~~~
instance/instance.c: In function ompi_instance_get_num_psets:
instance/instance.c:1032:39: error: PMIX_QUERY_NUM_PSETS undeclared (first use in this function); did you mean PMIX_QUERY_NAMESPACES?
 1032 |     ompi_instance_refresh_pmix_psets (PMIX_QUERY_NUM_PSETS);
      |                                       ^~~~~~~~~~~~~~~~~~~~
      |                                       PMIX_QUERY_NAMESPACES
make[2]: *** [Makefile:2595: runtime/ompi_rte.lo] Error 1
instance/instance.c: In function ompi_instance_get_nth_pset:
instance/instance.c:1041:43: error: PMIX_QUERY_PSET_NAMES undeclared (first use in this function); did you mean PMIX_QUERY_CREATE?
 1041 |         ompi_instance_refresh_pmix_psets (PMIX_QUERY_PSET_NAMES);
      |                                           ^~~~~~~~~~~~~~~~~~~~~
      |                                           PMIX_QUERY_CREATE
instance/instance.c: In function ompi_instance_group_pmix_pset:
instance/instance.c:1196:27: error: PMIX_PSET_NAME undeclared (first use in this function); did you mean PMIX_RM_NAME?
 1196 |         rc = PMIx_Get(&p, PMIX_PSET_NAME, NULL, 0, &pval);
      |                           ^~~~~~~~~~~~~~
      |                           PMIX_RM_NAME
make[2]: *** [Makefile:2595: instance/instance.lo] Error 1
make[2]: Leaving directory '/dev/shm/OpenMPI/5.0.0/GCC-10.3.0/openmpi-5.0.0rc6/ompi'
make[1]: *** [Makefile:2702: all-recursive] Error 1
make[1]: Leaving directory '/dev/shm/OpenMPI/5.0.0/GCC-10.3.0/openmpi-5.0.0rc6/ompi'
make: *** [Makefile:1484: all-recursive] Error 1

robert-mijakovic avatar Apr 29 '22 12:04 robert-mijakovic

@robert-mijakovic Thanks for the report.

the main and v5.0.x branches do not support ORTE - it has been removed from the code-base. The only option for building ompi v5.0.x is with prrte (or without it as with the --without-prrte configure option). If you would like to run a stable version of OMPI with ORTE + PMIx v3.2.3, your only option would be the v4/v4.1 series.

That being said, building with PMIx v3.2.3 on v5.0.x should probably work, and will need to be looked into.

awlauria avatar May 02 '22 22:05 awlauria

@awlauria Thanks for the explanation.

there are still some configure time options inside of OpenMPI 5.0.x branch that suggest that it is still possible to use ORTE, like:

  1. --enable-orterun-prefix-by-default - It is possible to enable ORTE run prefix
  2. --without-prrte - It is also possible to disable PRRTE, meaning that the other RTE should be able to take over.
  3. contrib/platform - Inside there are many occasions where configuration options suggest that ORTE is still available. For in stance mellanox/optimized enable_orterun_prefix_by_default=yes

Which RTE are available for OpenMPI aside from PRRTE and retired ORTE?

robert-mijakovic avatar May 03 '22 09:05 robert-mijakovic

--enable-orterun-prefix-by-default is deprecated, and an alias for --enable-prte-prefix-by-default. We kept it around for backwards compatibility reasons.

You are correct that there are still references to orte in the contrib files that likely need to be updated to their prte equivalents (or just removed completely). Thanks for noticing this, it should get cleaned up.

Regarding alternative launchers to prrte, you should be able to use SLURM's srun, as documented here: https://docs.open-mpi.org/en/v5.0.x/running-apps/quickstart.html?highlight=srun#using-the-scheduler-to-direct-launch-without-mpirun

awlauria avatar May 03 '22 14:05 awlauria

@robert-mijakovic regarding support for v3.2.3 PMIx for OMPI v5 - it is a complex issue that will need to get sorted prior to release. There is some internal discussion going on of whether this should/can work and what versions of PMIx will be supported.

We'll update the ticket with more details as they come in.

This issue has brought to light that the required dependencies for ompi are not currently documented. I opened https://github.com/open-mpi/ompi/issues/10345 to track this gap.

awlauria avatar May 03 '22 15:05 awlauria

I'll note that with ff7fd20eb74003dc9524959a278a9420d763a7b1 it seems that 5.0.0rc7 now requires an as yet unreleased version of PMIX to define PMIX_LOG_AGG.

opoplawski avatar May 16 '22 01:05 opoplawski

Thanks @opoplawski that is a bug..I'll submit a patch to fix that.

awlauria avatar May 16 '22 12:05 awlauria

The PMIX_LOG_AGG issue was fixed in main #10393 and v5.0.x #10394.

jsquyres avatar May 16 '22 19:05 jsquyres

will be fixed via #10371

hppritcha avatar May 17 '22 15:05 hppritcha

The PMIX_LOG_AGG issue was fixed in main https://github.com/open-mpi/ompi/pull/10393 and v5.0.x https://github.com/open-mpi/ompi/pull/10394.

I don't think you want to do that - it means you have no aggregation, which is exactly what you said you didn't want to do. The correct solution is to update the submodule pointer and configure to require PMIx v4.1.3 or above.

rhc54 avatar May 28 '22 14:05 rhc54

@robert-mijakovic regarding support for v3.2.3 PMIx for OMPI v5 - it is a complex issue that will need to get sorted prior to release. There is some internal discussion going on of whether this should/can work and what versions of PMIx will be supported.

There really is no way to fully support PMIx versions prior to v4.1.3. For one thing, PRRTE will never do so, which means you immediately lose mpirun, which is not desirable as you subsequently lose several new OMPI features, including fault tolerance and MPI sessions.

The --without-prrte option means that you are only going to use OMPI in direct launch scenarios (e.g., using srun with Slurm), so use of that option acknowledges the disablement of mpirun. It might be possible for the MPI and OPAL layers to add enough #define protection to allow OMPI to compile with earlier PMIx versions - but it will get messy. Up to the OMPI folks to decide if that's something they want to do. Note that building OMPI against a PMIx v4.1.3+ has no impact on the version used for Slurm due to cross-version support.

Also note that Slurm has been updated to build/support PMIx v4+ versions - it was just a configure issue and they finally fixed it. Not sure how far back they ported that fix, so you might ask them.

rhc54 avatar May 28 '22 14:05 rhc54

https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/required-support-libraries.html#required-support-libraries

FYI Open MPI 5 requires minimal openpmix version 4.2.0

wenduwan avatar Mar 14 '24 16:03 wenduwan

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] avatar Mar 28 '24 17:03 github-actions[bot]

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

github-actions[bot] avatar Apr 11 '24 17:04 github-actions[bot]