ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Update MCA mutexes to use the qthreads ULT backend

Open janciesko opened this issue 2 years ago • 6 comments

  • This replaces the use of opal atomics with the use of qthread's atomics in the opal/thread MCA if ompi is configured with qthreads.
  • This resolves the deadlock reported originating from expecting recursive locks to work (https://github.com/open-mpi/ompi/issues/10459).
  • Requires Qthreads version >=1.18 as we added support for recursive locks in that version. I will issue a PR for the configury changes to version check.

Signed-off-by: Jan Ciesko [email protected]

janciesko avatar Sep 21 '22 21:09 janciesko

Can one of the admins verify this patch?

ompiteam-bot avatar Sep 21 '22 21:09 ompiteam-bot

ok to test

gpaulsen avatar Sep 21 '22 21:09 gpaulsen

@janciesko Would you be able to rebase your branch on main somewhere after 7dbfbeea - build: Use open-mpi/oac for oac submodule commit? We're having an issue with the IBM CI when it tries to test a Pull Request that doesn't include that commit.

gpaulsen avatar Sep 21 '22 21:09 gpaulsen

@janciesko And be sure to see https://www.mail-archive.com/[email protected]/msg21421.html

jsquyres avatar Sep 21 '22 21:09 jsquyres

bot:ibm:retest

gpaulsen avatar Sep 21 '22 21:09 gpaulsen

@gpaulsen, rebased.

janciesko avatar Sep 21 '22 21:09 janciesko

/azp run

jsquyres avatar Sep 23 '22 10:09 jsquyres

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Sep 23 '22 10:09 azure-pipelines[bot]

../../opal/mca/threads/pthreads/threads_pthreads_mutex.h: In function 'opal_thread_internal_mutex_lock':
../../opal/mca/threads/pthreads/threads_pthreads_mutex.h:113:9: error: implicit declaration of function 'opal_show_help' [-Werror=implicit-function-declaration]
         opal_show_help("help-opal-threads.txt", "mutex lock failed", true);

You need to #include "opal/util/show_help.h".

jsquyres avatar Sep 23 '22 21:09 jsquyres

I think we do have that: https://github.com/open-mpi/ompi/blob/65e6591df99620bd52a77445e42338bbcfdacb79/opal/mca/threads/argobots/threads_argobots_mutex.h#L43 ~Is the from the Mellanox CI?~ Missing for pthreads. Updating.

janciesko avatar Sep 23 '22 22:09 janciesko

Added missing include.

janciesko avatar Sep 23 '22 22:09 janciesko

bot:aws:retest

jsquyres avatar Sep 27 '22 22:09 jsquyres

bot:aws:retest

bwbarrett avatar Sep 28 '22 01:09 bwbarrett

The failure from the Pull Request Build Checker is real, but caused by a bad commit to the main branch. If you push a rebase on top of main, it should go away.

bwbarrett avatar Sep 29 '22 01:09 bwbarrett

Rebased to today's main. Thanks @bwbarrett.

janciesko avatar Oct 03 '22 16:10 janciesko

is this good to go or waiting on something?

awlauria avatar Oct 12 '22 13:10 awlauria

LGTM

janciesko avatar Oct 13 '22 19:10 janciesko