ompi icon indicating copy to clipboard operation
ompi copied to clipboard

ULFM: new functions MPI_COMM_ACK/GET_FAILED (as per latest standard draft)

Open abouteiller opened this issue 2 years ago • 6 comments

This PR adds the new functions introduced during the May 2022 MPI Forum plenary reading.

These functions provide a simplified interface to query the failed processes in a communicator, and re-enable any-source completing procedures. They are designed so that usage can be intermixed with deprecated interfaces failure_ack/failure_get_acked.

abouteiller avatar Jun 23 '22 11:06 abouteiller

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e01844601a023c686fff31ed7b47f01c

ibm-ompi avatar Jun 23 '22 11:06 ibm-ompi

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/09144ef0bf4021cc724775a15cdca8ec

ibm-ompi avatar Jun 23 '22 11:06 ibm-ompi

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/47422123a3609342e90157d42d09b25b

ibm-ompi avatar Jun 23 '22 11:06 ibm-ompi

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/145d47a8987a71176be771975052cd64

ibm-ompi avatar Jun 27 '22 18:06 ibm-ompi

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/c568a6de2f8ebf6c83df07a96bc93b35

ibm-ompi avatar Jun 27 '22 18:06 ibm-ompi

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/6da0b241bcd7358d7c9355d9aeda82c5

ibm-ompi avatar Jun 27 '22 18:06 ibm-ompi

@abouteiller Do you need to update the docs?

Also, this PR should be rebased so that it picks up the latest github actions (which will allow it to pass the required CI).

jsquyres avatar Feb 15 '23 19:02 jsquyres

@bosilca @abouteiller There is a some chance I messed something up, but mpi4py tests are failing [logs]. Please note that the failure happens in singleton MPI init mode (i.e. no mpiexec).

NOTE: This build is using repo&branch mpi4py/mpi4py@feature/MPI-5.

testAckFailed (test_ulfm.TestULFMSelf) ... python: communicator/ft/comm_ft.c:195: ompi_comm_ack_failed_internal: Assertion `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (failed_group))->obj_magic_id' failed.
[fv-az436-728:129926] *** Process received signal ***
[fv-az436-728:129926] Signal: Aborted (6)
[fv-az436-728:129926] Signal code:  (-6)
[fv-az436-728:129926] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f42c79d4520]
[fv-az436-728:129926] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f42c7a28a7c]
[fv-az436-728:129926] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f42c79d4476]
[fv-az436-728:129926] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f42c79ba7f3]
[fv-az436-728:129926] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f42c79ba71b]
[fv-az436-728:129926] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f42c79cbe96]
[fv-az436-728:129926] [ 6] /usr/local/lib/libmpi.so.0(ompi_comm_ack_failed_internal+0x2fc)[0x7f42c64643f0]
[fv-az436-728:129926] [ 7] /usr/local/lib/libmpi.so.0(MPIX_Comm_ack_failed+0x14c)[0x7f42c670e98c]
...

dalcinl avatar Feb 16 '23 21:02 dalcinl

I can't get your branch to work, it fails with

$ python3.11 test/test_ulfm.py
Traceback (most recent call last):
  File "test_ulfm.py", line 1, in <module>
    from mpi4py import MPI
ImportError: dlopen(Python/3.11/lib/python/site-packages/mpi4py/MPI.cpython-311-darwin.so, 0x0002): symbol not found in flat namespace '_MPIX_Comm_ack_failed'

I guess we need to build it upon this branch? I wonder why you did not added the discovery of the ULFM functions in the setup.py (where you have MPI_Type_create_f90_integer) ?

bosilca avatar Feb 16 '23 21:02 bosilca

I can't get your branch to work, it fails with

You get that error because your libmpi.dylib does not have MPIX_Comm_ack_failed.

I guess we need to build it upon this branch?

Do you mean Aurelien's branch, i.e this PR? Definitely! My whole point of triggering was to try the new ack_failed/get_failed APIs being added in this PR.

I wonder why you did not added the discovery of the ULFM functions in the setup.py (where you have MPI_Type_create_f90_integer) ?

I usually reserve that discovery for workarounds related to broken libraries.
Moreover, if I had implemented such discovery, you would not have noticed you are using the wrong branch :wink:

dalcinl avatar Feb 16 '23 21:02 dalcinl

Right, I was using the 5.0-rc without this patch. It would be nice if instead of that error the python wrapper would have raise a non-implemented exception (like you have for the MPI_Type_create_f90_integer function). This would allow people to use a non-compliant library for as long as they don't call any of the missing functions. But I'm being picky ;)

bosilca avatar Feb 16 '23 22:02 bosilca

In my own testings I get an error that looks related to coll_han.

[b00:45954:0:45954] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x150)
/home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/communicator/communicator.h: [ ompi_comm_size() ]
      ...
      503  */
      504 static inline int ompi_comm_size (const ompi_communicator_t* comm)
      505 {
==>   506     return comm->c_local_group->grp_proc_count;
      507 }
 0 0x000000000020bfdf ompi_comm_size()  /home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/communicator/communicator.h:506
 1 0x000000000020c6bb mca_coll_han_comm_create_new()  /home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/mca/coll/han/coll_han_subcomms.c:127
 2 0x00000000001e7093 mca_coll_han_barrier_intra_simple()  /home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/mca/coll/han/coll_han_barrier.c:36
 3 0x0000000000208d0e mca_coll_han_barrier_intra_dynamic()  /home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/mca/coll/han/coll_han_dynamic.c:800
 4 0x00000000000cdfb0 PMPI_Barrier()  /home/bouteill/ompi/master.debug/ompi/mpi/c/../../../../master/ompi/mpi/c/barrier.c:76
 5 0x000000000040115c main()  /home/bouteill/ompi/ompi-tests-public/ulfm-testing/api/getack.c:80

abouteiller avatar Feb 16 '23 22:02 abouteiller

Let me take a loook at this, han should never be involved in any collective related to a flavor of MPI_COMM_SELF.

bosilca avatar Feb 16 '23 22:02 bosilca

I am running in non-singleton case for that test. The han_comm_create_new procedure has no error checking, which means that when the SPLIT_TYPE fails line 121 (there are dead procs) it goes on to deref the not-created low_comm. We'll need to make a scrub on Han to check for error cases, but that's not related to the initial report.

abouteiller avatar Feb 16 '23 22:02 abouteiller

@bosilca Sorry, but maybe I'm not in the same page as you. First, the branch I used to trigger the build is not definitive. Second, once this PR gets merged, thing will work out of the box with the next v5.0 rc.

However, there is still one thing I'm not sure about regarding mpi4py support: once this PR is merged, Open MPI will expose the MPI-5.0 APis (with MPIX_ prefix). It is OK to just rely and support the new APIs, and completely forget about the old ones, then mpi4py would raise NotImplementedError? Or should we at least try to emulate the new MPIX_Comm_ack_failed() with old APIs (as currently done in my branck), but just forget about MPIX_Comm_get_failed() (again, NotImplementedError) because it cannot be implemented with old APIs?

dalcinl avatar Feb 16 '23 22:02 dalcinl

Both old and new APIs will remain available within Open MPI itself for the time being. No need to emulate them on your end. The old API will not be standardized (they were provisional extensions) and eventually will be phased out.

abouteiller avatar Feb 16 '23 22:02 abouteiller

My take: For the long term only one of these two APIs is blessed by the standard, and that one shall survive. The other will be deprecated once the MPI implementations stop supporting it (and raise NotImplementedError)

bosilca avatar Feb 16 '23 22:02 bosilca

I am running in non-singleton case for that test. The han_comm_create_new procedure has no error checking, which means that when the SPLIT_TYPE fails line 121 (there are dead procs) it goes on to deref the not-created low_comm. We'll need to make a scrub on Han to check for error cases, but that's not related to the initial report.

Boomer, I haven't thought about the case when the underlying communicator already has a faulty process and cannot perform collective. Nice catch !

bosilca avatar Feb 16 '23 22:02 bosilca

Folks, would you please consider adding to define macros to mpiext_ftmpi_c.h to advertise the availability of the new routines? Let say:

#define OMPI_HAVE_MPIX_COMM_GET_FAILED 1
#define OMPI_HAVE_MPIX_COMM_ACK_FAILED 1

This would make things easier for mpi4py. Moreover, I guess other ULFM users may also find it very handy to just conditional-compile using the C preprocessor without having to add additional configure tests.

dalcinl avatar Feb 17 '23 08:02 dalcinl

@abouteiller and @dalcinl I am unable to replicate your issues with the current version (and I merged two pending PRs related to ULFM last Friday), neither on my laptop (OSX) not on our cluster (x86 + UCX). Can you give me more info on how you configure and maybe what MCA params are you using.

bosilca avatar Feb 20 '23 15:02 bosilca

@bosilca Here you have, this is what I use in GitHub actions, click the link to confirm.

[configure]

./configure --prefix=/home/devel/mpi/openmpi/dev --without-ofi --without-ucx --without-psm2 --with-pmix=internal --with-prrte=internal --with-libevent=internal --with-hwloc=internal --enable-debug --enable-mem-debug --disable-man-pages --disable-sphinx

[MCA parameters]

# ~/.openmpi/mca-params.conf"
mpi_param_check = 1
mpi_show_handle_leaks = 1 

# ~/.prte/mca-params.conf
rmaps_default_mapping_policy = :oversubscribe

dalcinl avatar Feb 20 '23 15:02 dalcinl

I'm still not able to reproduce but I think I know why. The python ULFM tester skips 4 tests, all related to the new functionality. It does use the correct MPI, aka. that includes the 2 new functions MPI_Comm_get_failed and MPI_Comm_ack_failed), but they are not correctly picked up by the mpi4py PR.

bosilca avatar Feb 20 '23 16:02 bosilca

The python ULFM tester skips 4 tests, all related to the new functionality.

Are you using GH repo mpi4py/mpi4py at branch feature/MPI-5 (not your clone!)? Please fetch that repo, and checkout/hard-reset to that branch, then make sure you build with Aurelien's branch (GH repo abouteiller/ompi-aurelien at branch ulfm/get_failed). Maybe you have a wrong $PATH and mpi4py is picking wrong mpicc ?

I fired a new build here just to make sure I'm not messing up.

dalcinl avatar Feb 20 '23 19:02 dalcinl

@bosilca Oh, man, sorry, I messed up indeed. Please git fetch and git reset --hard the mpi4py repo at branch feature/MPI-5. Now src/lib-mpi/ulfm.h has the right guard on the major ompi version:

#if defined(OPEN_MPI)
#include <mpi-ext.h>
#ifdef OMPI_HAVE_MPI_EXT_FTMPI
#define PyMPI_HAVE_MPIX_ERR_REVOKED 1
#define PyMPI_HAVE_MPIX_ERR_PROC_FAILED 1
#define PyMPI_HAVE_MPIX_ERR_PROC_FAILED_PENDING 1
#define PyMPI_HAVE_MPIX_Comm_revoke 1
#define PyMPI_HAVE_MPIX_Comm_is_revoked 1
#define PyMPI_HAVE_MPIX_Comm_failure_ack 1
#define PyMPI_HAVE_MPIX_Comm_failure_get_acked 1
#define PyMPI_HAVE_MPIX_Comm_agree 1
#define PyMPI_HAVE_MPIX_Comm_iagree 1
#define PyMPI_HAVE_MPIX_Comm_shrink 1
#if OMPI_MAJOR_VERSION >= 5 && OMPI_MAJOR_VERSION < 10
#define PyMPI_HAVE_MPIX_Comm_get_failed 1
#define PyMPI_HAVE_MPIX_Comm_ack_failed 1
#endif
#endif
#endif

PS: Please look at my previous request of adding preprocessor macros to advertise the new APIs.

dalcinl avatar Feb 20 '23 19:02 dalcinl

@dalcinl I found a typo in your commit that prevented things from working properly on my side. I tried to push to your branch but I'm not allowed, so here is the patch.

commit 7838dcec36a5781ec7a46ded1b0cc56b0b94b8eb
Author: George Bosilca <[email protected]>
Date:   Mon Feb 20 19:35:38 2023 -0500

    Update the MPI_Comm_get_failed support.
    
    Signed-off-by: George Bosilca <[email protected]>

diff --git a/src/lib-mpi/mpiulfm.h b/src/lib-mpi/mpiulfm.h
index 8b99a0a1..abdb7531 100644
--- a/src/lib-mpi/mpiulfm.h
+++ b/src/lib-mpi/mpiulfm.h
@@ -85,6 +85,7 @@
 
 #ifndef PyMPI_HAVE_MPI_Comm_get_failed
 #ifdef PyMPI_HAVE_MPIX_Comm_get_failed
+#undef MPI_Comm_get_failed
 #define MPI_Comm_get_failed MPIX_Comm_get_failed
 #endif
 #endif

With this I was able to run the tests and pinpoint the issue in this PR. Fix is coming soon.

bosilca avatar Feb 21 '23 00:02 bosilca

I found a typo in your commit that prevented things from working properly on my side

Of course! Should have tried it on my side, I wouldn't have wasted your time, sorry. I applied your fix and force-pushed to my branch again.

Looks like your commit fixed things [full logs]

...
2023-02-21T05:08:05.4204141Z testAckFailed (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4206682Z testAgree (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4209144Z testGetFailed (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4211765Z testIAgree (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4214251Z testLegacyAck (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4216654Z testLegacyGetAcked (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4219319Z testRevoke (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4221997Z testShrink (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4224849Z testAckFailed (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4227299Z testAgree (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4229634Z testGetFailed (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4232103Z testIAgree (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4234512Z testLegacyAck (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4236915Z testLegacyGetAcked (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4239429Z testRevoke (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4242103Z testShrink (test_ulfm.TestULFMWorld) ... ok
...

dalcinl avatar Feb 21 '23 05:02 dalcinl

Hello! The Git Commit Checker CI bot found a few problems with this PR:

8bbeb4db: Add #defines for the new API.

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions[bot] avatar Feb 21 '23 20:02 github-actions[bot]

@dalcinl The requested defines are in. @abouteiller the test seems to pass for me, feel free to merge this PR once all the github actions passed.

bosilca avatar Feb 21 '23 20:02 bosilca

Folks, please ping me once you submit a new PR to branch v.5.0.x.

dalcinl avatar Feb 21 '23 21:02 dalcinl

bot:ibm:retest

jjhursey avatar Feb 21 '23 22:02 jjhursey

@abouteiller Do you need to update the docs?

The docs are in #10781, I'll actually mark it as ready when this is merged.

abouteiller avatar Feb 21 '23 23:02 abouteiller