ompi
ompi copied to clipboard
ULFM: new functions MPI_COMM_ACK/GET_FAILED (as per latest standard draft)
This PR adds the new functions introduced during the May 2022 MPI Forum plenary reading.
These functions provide a simplified interface to query the failed processes in a communicator, and re-enable any-source completing procedures. They are designed so that usage can be intermixed with deprecated interfaces failure_ack/failure_get_acked.
The IBM CI (GNU/Scale) build failed! Please review the log, linked below.
Gist: https://gist.github.com/e01844601a023c686fff31ed7b47f01c
The IBM CI (XL) build failed! Please review the log, linked below.
Gist: https://gist.github.com/09144ef0bf4021cc724775a15cdca8ec
The IBM CI (PGI) build failed! Please review the log, linked below.
Gist: https://gist.github.com/47422123a3609342e90157d42d09b25b
The IBM CI (GNU/Scale) build failed! Please review the log, linked below.
Gist: https://gist.github.com/145d47a8987a71176be771975052cd64
The IBM CI (XL) build failed! Please review the log, linked below.
Gist: https://gist.github.com/c568a6de2f8ebf6c83df07a96bc93b35
The IBM CI (PGI) build failed! Please review the log, linked below.
Gist: https://gist.github.com/6da0b241bcd7358d7c9355d9aeda82c5
@abouteiller Do you need to update the docs?
Also, this PR should be rebased so that it picks up the latest github actions (which will allow it to pass the required CI).
@bosilca @abouteiller There is a some chance I messed something up, but mpi4py tests are failing [logs]. Please note that the failure happens in singleton MPI init mode (i.e. no mpiexec
).
NOTE: This build is using repo&branch mpi4py/mpi4py@feature/MPI-5
.
testAckFailed (test_ulfm.TestULFMSelf) ... python: communicator/ft/comm_ft.c:195: ompi_comm_ack_failed_internal: Assertion `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (failed_group))->obj_magic_id' failed.
[fv-az436-728:129926] *** Process received signal ***
[fv-az436-728:129926] Signal: Aborted (6)
[fv-az436-728:129926] Signal code: (-6)
[fv-az436-728:129926] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f42c79d4520]
[fv-az436-728:129926] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f42c7a28a7c]
[fv-az436-728:129926] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f42c79d4476]
[fv-az436-728:129926] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f42c79ba7f3]
[fv-az436-728:129926] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f42c79ba71b]
[fv-az436-728:129926] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f42c79cbe96]
[fv-az436-728:129926] [ 6] /usr/local/lib/libmpi.so.0(ompi_comm_ack_failed_internal+0x2fc)[0x7f42c64643f0]
[fv-az436-728:129926] [ 7] /usr/local/lib/libmpi.so.0(MPIX_Comm_ack_failed+0x14c)[0x7f42c670e98c]
...
I can't get your branch to work, it fails with
$ python3.11 test/test_ulfm.py
Traceback (most recent call last):
File "test_ulfm.py", line 1, in <module>
from mpi4py import MPI
ImportError: dlopen(Python/3.11/lib/python/site-packages/mpi4py/MPI.cpython-311-darwin.so, 0x0002): symbol not found in flat namespace '_MPIX_Comm_ack_failed'
I guess we need to build it upon this branch? I wonder why you did not added the discovery of the ULFM functions in the setup.py (where you have MPI_Type_create_f90_integer
) ?
I can't get your branch to work, it fails with
You get that error because your libmpi.dylib does not have MPIX_Comm_ack_failed
.
I guess we need to build it upon this branch?
Do you mean Aurelien's branch, i.e this PR? Definitely!
My whole point of triggering was to try the new ack_failed
/get_failed
APIs being added in this PR.
I wonder why you did not added the discovery of the ULFM functions in the setup.py (where you have
MPI_Type_create_f90_integer
) ?
I usually reserve that discovery for workarounds related to broken libraries.
Moreover, if I had implemented such discovery, you would not have noticed you are using the wrong branch :wink:
Right, I was using the 5.0-rc without this patch. It would be nice if instead of that error the python wrapper would have raise a non-implemented exception (like you have for the MPI_Type_create_f90_integer function). This would allow people to use a non-compliant library for as long as they don't call any of the missing functions. But I'm being picky ;)
In my own testings I get an error that looks related to coll_han.
[b00:45954:0:45954] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x150)
/home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/communicator/communicator.h: [ ompi_comm_size() ]
...
503 */
504 static inline int ompi_comm_size (const ompi_communicator_t* comm)
505 {
==> 506 return comm->c_local_group->grp_proc_count;
507 }
0 0x000000000020bfdf ompi_comm_size() /home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/communicator/communicator.h:506
1 0x000000000020c6bb mca_coll_han_comm_create_new() /home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/mca/coll/han/coll_han_subcomms.c:127
2 0x00000000001e7093 mca_coll_han_barrier_intra_simple() /home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/mca/coll/han/coll_han_barrier.c:36
3 0x0000000000208d0e mca_coll_han_barrier_intra_dynamic() /home/bouteill/ompi/master.debug/ompi/mca/coll/han/../../../../../master/ompi/mca/coll/han/coll_han_dynamic.c:800
4 0x00000000000cdfb0 PMPI_Barrier() /home/bouteill/ompi/master.debug/ompi/mpi/c/../../../../master/ompi/mpi/c/barrier.c:76
5 0x000000000040115c main() /home/bouteill/ompi/ompi-tests-public/ulfm-testing/api/getack.c:80
Let me take a loook at this, han should never be involved in any collective related to a flavor of MPI_COMM_SELF.
I am running in non-singleton case for that test. The han_comm_create_new procedure has no error checking, which means that when the SPLIT_TYPE fails line 121 (there are dead procs) it goes on to deref the not-created low_comm. We'll need to make a scrub on Han to check for error cases, but that's not related to the initial report.
@bosilca Sorry, but maybe I'm not in the same page as you. First, the branch I used to trigger the build is not definitive. Second, once this PR gets merged, thing will work out of the box with the next v5.0 rc.
However, there is still one thing I'm not sure about regarding mpi4py support: once this PR is merged, Open MPI will expose the MPI-5.0 APis (with MPIX_ prefix). It is OK to just rely and support the new APIs, and completely forget about the old ones, then mpi4py would raise NotImplementedError
? Or should we at least try to emulate the new MPIX_Comm_ack_failed()
with old APIs (as currently done in my branck), but just forget about MPIX_Comm_get_failed()
(again, NotImplementedError
) because it cannot be implemented with old APIs?
Both old and new APIs will remain available within Open MPI itself for the time being. No need to emulate them on your end. The old API will not be standardized (they were provisional extensions) and eventually will be phased out.
My take: For the long term only one of these two APIs is blessed by the standard, and that one shall survive. The other will be deprecated once the MPI implementations stop supporting it (and raise NotImplementedError
)
I am running in non-singleton case for that test. The han_comm_create_new procedure has no error checking, which means that when the SPLIT_TYPE fails line 121 (there are dead procs) it goes on to deref the not-created low_comm. We'll need to make a scrub on Han to check for error cases, but that's not related to the initial report.
Boomer, I haven't thought about the case when the underlying communicator already has a faulty process and cannot perform collective. Nice catch !
Folks, would you please consider adding to define macros to mpiext_ftmpi_c.h
to advertise the availability of the new routines? Let say:
#define OMPI_HAVE_MPIX_COMM_GET_FAILED 1
#define OMPI_HAVE_MPIX_COMM_ACK_FAILED 1
This would make things easier for mpi4py. Moreover, I guess other ULFM users may also find it very handy to just conditional-compile using the C preprocessor without having to add additional configure tests.
@abouteiller and @dalcinl I am unable to replicate your issues with the current version (and I merged two pending PRs related to ULFM last Friday), neither on my laptop (OSX) not on our cluster (x86 + UCX). Can you give me more info on how you configure and maybe what MCA params are you using.
@bosilca Here you have, this is what I use in GitHub actions, click the link to confirm.
[configure]
./configure --prefix=/home/devel/mpi/openmpi/dev --without-ofi --without-ucx --without-psm2 --with-pmix=internal --with-prrte=internal --with-libevent=internal --with-hwloc=internal --enable-debug --enable-mem-debug --disable-man-pages --disable-sphinx
[MCA parameters]
# ~/.openmpi/mca-params.conf"
mpi_param_check = 1
mpi_show_handle_leaks = 1
# ~/.prte/mca-params.conf
rmaps_default_mapping_policy = :oversubscribe
I'm still not able to reproduce but I think I know why. The python ULFM tester skips 4 tests, all related to the new functionality. It does use the correct MPI, aka. that includes the 2 new functions MPI_Comm_get_failed
and MPI_Comm_ack_failed
), but they are not correctly picked up by the mpi4py PR.
The python ULFM tester skips 4 tests, all related to the new functionality.
Are you using GH repo mpi4py/mpi4py
at branch feature/MPI-5
(not your clone!)? Please fetch that repo, and checkout/hard-reset to that branch, then make sure you build with Aurelien's branch (GH repo abouteiller/ompi-aurelien
at branch ulfm/get_failed
). Maybe you have a wrong $PATH
and mpi4py is picking wrong mpicc
?
I fired a new build here just to make sure I'm not messing up.
@bosilca Oh, man, sorry, I messed up indeed. Please git fetch
and git reset --hard
the mpi4py repo at branch feature/MPI-5
. Now src/lib-mpi/ulfm.h
has the right guard on the major ompi version:
#if defined(OPEN_MPI)
#include <mpi-ext.h>
#ifdef OMPI_HAVE_MPI_EXT_FTMPI
#define PyMPI_HAVE_MPIX_ERR_REVOKED 1
#define PyMPI_HAVE_MPIX_ERR_PROC_FAILED 1
#define PyMPI_HAVE_MPIX_ERR_PROC_FAILED_PENDING 1
#define PyMPI_HAVE_MPIX_Comm_revoke 1
#define PyMPI_HAVE_MPIX_Comm_is_revoked 1
#define PyMPI_HAVE_MPIX_Comm_failure_ack 1
#define PyMPI_HAVE_MPIX_Comm_failure_get_acked 1
#define PyMPI_HAVE_MPIX_Comm_agree 1
#define PyMPI_HAVE_MPIX_Comm_iagree 1
#define PyMPI_HAVE_MPIX_Comm_shrink 1
#if OMPI_MAJOR_VERSION >= 5 && OMPI_MAJOR_VERSION < 10
#define PyMPI_HAVE_MPIX_Comm_get_failed 1
#define PyMPI_HAVE_MPIX_Comm_ack_failed 1
#endif
#endif
#endif
PS: Please look at my previous request of adding preprocessor macros to advertise the new APIs.
@dalcinl I found a typo in your commit that prevented things from working properly on my side. I tried to push to your branch but I'm not allowed, so here is the patch.
commit 7838dcec36a5781ec7a46ded1b0cc56b0b94b8eb
Author: George Bosilca <[email protected]>
Date: Mon Feb 20 19:35:38 2023 -0500
Update the MPI_Comm_get_failed support.
Signed-off-by: George Bosilca <[email protected]>
diff --git a/src/lib-mpi/mpiulfm.h b/src/lib-mpi/mpiulfm.h
index 8b99a0a1..abdb7531 100644
--- a/src/lib-mpi/mpiulfm.h
+++ b/src/lib-mpi/mpiulfm.h
@@ -85,6 +85,7 @@
#ifndef PyMPI_HAVE_MPI_Comm_get_failed
#ifdef PyMPI_HAVE_MPIX_Comm_get_failed
+#undef MPI_Comm_get_failed
#define MPI_Comm_get_failed MPIX_Comm_get_failed
#endif
#endif
With this I was able to run the tests and pinpoint the issue in this PR. Fix is coming soon.
I found a typo in your commit that prevented things from working properly on my side
Of course! Should have tried it on my side, I wouldn't have wasted your time, sorry. I applied your fix and force-pushed to my branch again.
Looks like your commit fixed things [full logs]
...
2023-02-21T05:08:05.4204141Z testAckFailed (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4206682Z testAgree (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4209144Z testGetFailed (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4211765Z testIAgree (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4214251Z testLegacyAck (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4216654Z testLegacyGetAcked (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4219319Z testRevoke (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4221997Z testShrink (test_ulfm.TestULFMSelf) ... ok
2023-02-21T05:08:05.4224849Z testAckFailed (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4227299Z testAgree (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4229634Z testGetFailed (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4232103Z testIAgree (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4234512Z testLegacyAck (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4236915Z testLegacyGetAcked (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4239429Z testRevoke (test_ulfm.TestULFMWorld) ... ok
2023-02-21T05:08:05.4242103Z testShrink (test_ulfm.TestULFMWorld) ... ok
...
Hello! The Git Commit Checker CI bot found a few problems with this PR:
8bbeb4db: Add #defines for the new API.
- check_signed_off: does not contain a valid Signed-off-by line
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!
@dalcinl The requested defines are in. @abouteiller the test seems to pass for me, feel free to merge this PR once all the github actions passed.
Folks, please ping me once you submit a new PR to branch v.5.0.x.
bot:ibm:retest
@abouteiller Do you need to update the docs?
The docs are in #10781, I'll actually mark it as ready when this is merged.