mpich
mpich copied to clipboard
pmi: add thread support to PMI_Barrier_group
Pull Request Description
- Add stringtag to
PMI_Barrier_groupfunction signature.
int PMI_Barrier_group(const int *group, int count, const char *stringtag);
-
PMI_Barrier()is the same asPMI_Barrier_group(PMI_GROUP_WORLD, 0, NULL) - Set environment
PMI_IS_THREADEDto enable threaded support in PMI. Usesetenvbefore callingPMI_Init. - Only the following functions are allowed to be used in multiple threads concurrently:
-
PMI_KVS_Put -
PMI_KVS_Get -
PMI_Barrier_group
-
[skip warnings]
CHANGES
- Deprecate PMI v2
- Remove PMI2 thread support
- There is no users
- It does not work for multi-threaded fence (or barrier) since there is no mechanism of collective thread matching.
- Remove
MPIR_pmi_is_threaded. There is no good place to call this API, Underlying PMI either support thread or not, neither needs setting.
Implementation
Client (libpmi)
-
PMI_KVS_PutandPMI_KVS_Getare lock protected -
PMI_Barrier_groupinternally is nonblocking, an atomic query followed with atomic tests in a while-loop -
PMI_cmd_readenqueues unexpected barrier response -
PMIU_cmd_test_barrierpeek and "unreads" any handled pmi commands.
Server (hydra)
- Combine the
groupstring andstringtagfor hash key to the barrier. - In case
groupandstringtagaren't separating the threads, useepochto avoid barrier deadlocks. The kvs synch may get mixed but at least we don't dead lock and can give users errors. - It is strictly serialized between proxy and server. Proxy will hold back epochs when the top epoch is in progress.
Diagram
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch4/most test:mpich/ch3/most
Note: the ch4-ofi-asan tests uses the socket provider and suffers from collective hangs during initialization due to fi_inject send. I'll address this separately.
test:mpich/authorship
test:mpich/ch3/tcp
test:mpich/authorship
test:mpich/ch3/most
test:mpich/ch4/most