mpich icon indicating copy to clipboard operation
mpich copied to clipboard

pmi: add thread support to PMI_Barrier_group

Open hzhou opened this issue 10 months ago • 1 comments

Pull Request Description

  • Add stringtag to PMI_Barrier_group function signature.
int PMI_Barrier_group(const int *group, int count, const char *stringtag);
  • PMI_Barrier() is the same as PMI_Barrier_group(PMI_GROUP_WORLD, 0, NULL)
  • Set environment PMI_IS_THREADED to enable threaded support in PMI. Use setenv before calling PMI_Init.
  • Only the following functions are allowed to be used in multiple threads concurrently:
    • PMI_KVS_Put
    • PMI_KVS_Get
    • PMI_Barrier_group

[skip warnings]

CHANGES

  • Deprecate PMI v2
  • Remove PMI2 thread support
    • There is no users
    • It does not work for multi-threaded fence (or barrier) since there is no mechanism of collective thread matching.
  • Remove MPIR_pmi_is_threaded. There is no good place to call this API, Underlying PMI either support thread or not, neither needs setting.

Implementation

Client (libpmi)

  • PMI_KVS_Put and PMI_KVS_Get are lock protected
  • PMI_Barrier_group internally is nonblocking, an atomic query followed with atomic tests in a while-loop
  • PMI_cmd_read enqueues unexpected barrier response
  • PMIU_cmd_test_barrier peek and "unreads" any handled pmi commands.

Server (hydra)

  • Combine the group string and stringtag for hash key to the barrier.
  • In case group and stringtag aren't separating the threads, use epoch to avoid barrier deadlocks. The kvs synch may get mixed but at least we don't dead lock and can give users errors.
  • It is strictly serialized between proxy and server. Proxy will hold back epochs when the top epoch is in progress.

Diagram

image

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Jun 10 '25 15:06 hzhou

test:mpich/ch4/most test:mpich/ch3/most

Note: the ch4-ofi-asan tests uses the socket provider and suffers from collective hangs during initialization due to fi_inject send. I'll address this separately.

hzhou avatar Jun 16 '25 23:06 hzhou

test:mpich/authorship

hzhou avatar Jun 18 '25 15:06 hzhou

test:mpich/ch3/tcp

hzhou avatar Jun 18 '25 15:06 hzhou

test:mpich/authorship

hzhou avatar Jun 18 '25 15:06 hzhou

test:mpich/ch3/most

hzhou avatar Jun 18 '25 15:06 hzhou

test:mpich/ch4/most

hzhou avatar Jun 18 '25 15:06 hzhou