ompi icon indicating copy to clipboard operation
ompi copied to clipboard

MTT ibm one-sided test failures for ompi v5.0.x

Open shijin-aws opened this issue 3 years ago • 6 comments

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./autogen.pl
./configure --prefix=<prefix> CFLAGS=-pipe --enable-picky --enable-debug --enable-mpi1-compatibility
make -j install

Part

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

[ec2-user@ip-172-31-8-95 ompi]$ git submodule status
 d3445c8fb15cfc4a03cfee27593ca1fe1a6d67ab 3rd-party/openpmix (v4.1.2-50-gd3445c8f)
 f3828e8307cf95d67a64eeaa4e36a362ac01e075 3rd-party/prrte (v2.0.2-71-gf3828e8307)

Please describe the system on which you are running

  • Operating system/version: amazon linux 2

Details of the problem

There are around 40 ibm test suite failures for ompi v5.0.x with tcp path. Full test report can be found in this mtt report

shijin-aws avatar Apr 07 '22 22:04 shijin-aws

There seems to be multiple issues. The following PR fixed one:

https://github.com/open-mpi/ompi/pull/10462

with this PR, 1sided pass.

wzamazon avatar Jun 09 '22 23:06 wzamazon

Another PR

https://github.com/open-mpi/ompi/pull/10463

This fixed the segfault of pp_1sided and halo_1sided_put_alloc_mem

wzamazon avatar Jun 10 '22 02:06 wzamazon

The hang with c_accumulate with efa turns out to be a bug in libfabric EFA installer. Fix is in https://github.com/ofiwg/libfabric/pull/7829. It will take a while for mtt to ingest the change.

wzamazon avatar Jun 13 '22 12:06 wzamazon

Remaining issue are:

  1. c_put_dynamic_self/c_get_dynamic_set always hangs, even for 2 ranks.
  2. When btl/tcp is used, there are segfaults with c_get_accumulate_ddt1 and c_get_accumulate_ddt2
  3. When btl/tcp is used, c_accumulate is quite slow, not sure it is normal or not.

wzamazon avatar Jun 13 '22 12:06 wzamazon

c_put_dynamic_self/c_get_dynamic_self hang will be fixed by PR https://github.com/open-mpi/ompi/pull/10473

wzamazon avatar Jun 15 '22 21:06 wzamazon

Remaining issues:

With btl/ofi, mt_1sided segfault.

With btl/tcp,

  1. multiple tests (1sided, c_accumulate, etc) hang.
  2. c_get_accumulate_ddt1 and c_get_accumulate_ddt2 segfault.

wzamazon avatar Jul 03 '22 03:07 wzamazon