ompi icon indicating copy to clipboard operation
ompi copied to clipboard

List of issues with OMPI master+ULT+PartCom

Open janciesko opened this issue 3 years ago • 21 comments

This tracks current issues for ULT- and PartCom support in OMPI.

ULT Support:

  • [x] Compilation fails on current master @ HEAD when configured to --with-threads={qthreads,argobots} with ../../opal/mca/threads/qthreads/threads_qthreads.h:29:10: fatal error: qthread.h: No such file or directory 29 | #include "qthread.h" Configure seems to be not setting -I/PATH_TO_ULT_LIB
  • [x] OMPI hangs when configured to --with-threads=argobots and argobots uses multiple streams and --map-by ppr:1:node thus using ucx and libevent. Requires https://github.com/shintaro-iwasaki/libevent/commits/2.0.22-abt in libevent. We need to merge that branch into libevent soon. Check why Qthreads works.
  • [x] MPI_Init_thread hangs on current master @ HEAD when configured to --with-threads={qthreads,argobots}. Maybe this is an issue in instance/instance.h after merging sessions-related code changes.
  • [x] UCX fails asserts on POW9 and ARM: Discussed this with Howard. An assert is failing in src/ucp/core/ucp_worker.c. I'll add more info once the latest OMPI version compiles.

Partitioned Communication:

  • [ ] Non equal numbers of send- and receive partitions leads to deadlock. Looking at a backtrace, it seems that the receiving rank waits for the last receive partition. MPI Partix is a reproducer when setting DEFAULT_RECV_SEND_PARTITION_RATIO=1 to another value such as 2.

General improvements:

  • [ ] Add ULT coverage to CI testing

Reproducers of all above:

git clone https://github.com/sandialabs/MPI-Partix.git
export PATH=$OMPI_INSTALL_PATH/bin
cmake ..  -DCMAKE_CXX_COMPILER=mpicxx -DQthreads_ROOT=$QTHREADS_INSTALL_PATH -DPartix_ENABLE_QTHREADS=ON
mpirun -np 2 --map-by ppr:1:node ./bench1

janciesko avatar Jun 08 '22 18:06 janciesko

@shintaro-iwasaki Can you have a look at this?

jsquyres avatar Jun 13 '22 19:06 jsquyres

As I left Argonne a year ago, I might not be the right person for this, but please let me know anything I can help. Jan should have known it.

(Currently, Yanfei Guo is the POC on the Argobots side.)

shintaro-iwasaki avatar Jun 14 '22 20:06 shintaro-iwasaki

@shintaro-iwasaki Gotcha. What's Yanfei Guo's github ID?

jsquyres avatar Jun 14 '22 21:06 jsquyres

@yfguo He is leading the MPICH project.

shintaro-iwasaki avatar Jun 16 '22 15:06 shintaro-iwasaki

@yfguo It looks like there are some errors with the qthreads implementation in Open MPI. @shintaro-iwasaki used to be the maintainer of this, but now -- perhaps by default? -- the ownership has apparently transferred to you...? We're having some compile issues as listed in this issue. Can you investigate?

jsquyres avatar Jun 16 '22 17:06 jsquyres

Sorry for interruption. I believe @janciesko owns the Qthreads/Open MPI part. For example, https://github.com/open-mpi/ompi/pull/8479 and https://github.com/open-mpi/ompi/pull/8500

I think Jan posted this issue to fix it later.

shintaro-iwasaki avatar Jun 16 '22 17:06 shintaro-iwasaki

Ah, ok. @janciesko Can you also update the owner.txt files so that we don't ping the wrong people in the future? 😄

jsquyres avatar Jun 16 '22 17:06 jsquyres

@janciesko is there any update on this? We may look to disable these for the v5.0.0 release if not completed in time, since they are new features it would not be a regression. That said, we would love to have them. We're targeting an end of summer release, so end of August/September timeframe. Thanks!

awlauria avatar Jul 26 '22 16:07 awlauria

@awlauria, thanks for clarifying the time scope for this. I'll look into it and report back.

janciesko avatar Jul 26 '22 16:07 janciesko

@bwbarrett, can you please take a look why ompi fails to compile when enabling the use of Qtheads or Argobots? The following commit introduced the error: https://github.com/open-mpi/ompi/commit/6767309af2c893ff2dfb22d901f4e2fb3b7a8f87.
Configure fails to set OPAL_QTHREADS_INCLUDE_PATH in ./opal/mca/threads/qthreads/threads_qthreads.h.in Same behavior applies to Argobots.

janciesko avatar Aug 05 '22 19:08 janciesko

I am working on a fix for the first item.

hppritcha avatar Aug 08 '22 20:08 hppritcha

This https://github.com/open-mpi/ompi/blob/96fadd9d6860bd1dc89f15e88a472f310ff13c89/ompi/instance/instance.c#L226 causes a deadlock as that lock is already taken by the executing thread. The lock is previously set here: https://github.com/open-mpi/ompi/blob/96fadd9d6860bd1dc89f15e88a472f310ff13c89/ompi/instance/instance.c#L800 https://github.com/open-mpi/ompi/pull/10720 ~~fixes item three~~.

janciesko avatar Aug 25 '22 21:08 janciesko

Item three needs support for recursive locks in Qthreads. This is in progress.

janciesko avatar Aug 30 '22 20:08 janciesko

~~We have added a draft PR for supporting recursive locks in Qthreads. https://github.com/Qthreads/qthreads/pull/111~~ With this feature, we resolve the deadlock reported in item three.

janciesko avatar Sep 08 '22 19:09 janciesko

Update: We have added a PR for supporting recursive locks in Qthreads. https://github.com/Qthreads/qthreads/pull/112

janciesko avatar Sep 16 '22 17:09 janciesko

Does Open MPI need to use some minimum version of Qthreads in order to ensure correctness? If so, it might be worthwhile to enforce that in the qthreads component's configure.m4.

jsquyres avatar Sep 16 '22 17:09 jsquyres

@jsquyres, yes and yes, we'd need a version check for qthreads to be >=1.18.

janciesko avatar Sep 21 '22 21:09 janciesko

#108032 fixes item three.

janciesko avatar Sep 21 '22 21:09 janciesko

I am marking issue 4 as complete as I cannot reproduce that issue anymore.

janciesko avatar Oct 10 '22 16:10 janciesko

~~I am marking issue 2 as complete as I cannot reproduce that issue anymore. In this case I am using the default downstream dep on libevent 2.1.12.~~ The issue persists and is specific to libevent. In lieu of the check box item I have created a new issue for libevent. I am marking checkbox 2 as complete.

janciesko avatar Oct 10 '22 18:10 janciesko

@jsquyres, addressing the last checkbox, what methodology would you propose to test OMPI and the threading MCA when configuring OMPI with a supported ULT lib? ASFAIK we do not test this. We could add a test similar to this one and test with a release version of both ULT libs?

janciesko avatar Oct 11 '22 16:10 janciesko