ompi
ompi copied to clipboard
MTT ibm one-sided test failures for ompi v5.0.x
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.x branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
./autogen.pl
./configure --prefix=<prefix> CFLAGS=-pipe --enable-picky --enable-debug --enable-mpi1-compatibility
make -j install
Part
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
[ec2-user@ip-172-31-8-95 ompi]$ git submodule status
d3445c8fb15cfc4a03cfee27593ca1fe1a6d67ab 3rd-party/openpmix (v4.1.2-50-gd3445c8f)
f3828e8307cf95d67a64eeaa4e36a362ac01e075 3rd-party/prrte (v2.0.2-71-gf3828e8307)
Please describe the system on which you are running
- Operating system/version: amazon linux 2
Details of the problem
There are around 40 ibm test suite failures for ompi v5.0.x with tcp path. Full test report can be found in this mtt report
There seems to be multiple issues. The following PR fixed one:
https://github.com/open-mpi/ompi/pull/10462
with this PR, 1sided pass.
Another PR
https://github.com/open-mpi/ompi/pull/10463
This fixed the segfault of pp_1sided and halo_1sided_put_alloc_mem
The hang with c_accumulate with efa turns out to be a bug in libfabric EFA installer. Fix is in https://github.com/ofiwg/libfabric/pull/7829. It will take a while for mtt to ingest the change.
Remaining issue are:
- c_put_dynamic_self/c_get_dynamic_set always hangs, even for 2 ranks.
- When btl/tcp is used, there are segfaults with
c_get_accumulate_ddt1andc_get_accumulate_ddt2 - When btl/tcp is used,
c_accumulateis quite slow, not sure it is normal or not.
c_put_dynamic_self/c_get_dynamic_self hang will be fixed by PR https://github.com/open-mpi/ompi/pull/10473
Remaining issues:
With btl/ofi, mt_1sided segfault.
With btl/tcp,
- multiple tests (
1sided,c_accumulate, etc) hang. c_get_accumulate_ddt1andc_get_accumulate_ddt2segfault.