charm
charm copied to clipboard
Optimization of MPI layer
Original issue: https://charm.cs.illinois.edu/redmine/issues/23
Revisit MPI layer and possibly rewrite to improve performance.
Original date: 2013-02-06 06:29:03
I've heard through the grapevine (not to be confused with Harshitha's LB strategy) that Ralf has achieved near-parity with some native layer in his work at Argonne for Pavan. Marking it 'in-progress' accordingly.
Original date: 2013-02-11 22:24:31
Quoting Ralf's Email on this-
As for MPI, we've found something that apparently makes it be as fast as gemini_gni-crayxe, and the same seems to be true of ibverbs. We're testing this on intrepid & vesta as well. Pavan doesn't care about tcp (i.e. MPI over sockets) at all, where the difference in performance is biggest (almost 4x).
Original date: 2017-01-17 03:29:53
It looks like none of the work mentioned above was ever merged...
Original date: 2018-04-29 22:45:56
Small optimization to use MPI-3's MPI_Mprobe and MPI_Mrecv where possible: ~~https://charm.cs.illinois.edu/gerrit/#/c/charm/+/2785/~~ https://github.com/UIUC-PPL/charm/commit/f7fbaaaae1088ee65988c1345203da3ed55abcc4
Edit: we ended up reverting this because support for MPI_Mprobe is spotty and not really detectable at configure time in a portable manner.
I reached out to Ralf in 2018 about these patches and they are presumed lost. However he did mention that "IIRC the code was mostly using MPI3 RMA operations in a fairly natural way."
wow.. it was not even saved on a branch?
Not if it was never pushed out of his local repository.
Here's some minor MPI layer optimizations we've done in the recent-ish past with some notes on their status / motivations:
Add build-time option to use MPI_Alloc_mem inside CmiAlloc. This will potentially speed up messaging but is off by default pending more performance testing and may need more pooling of messages: https://github.com/UIUC-PPL/charm/commit/724b2a02084e288bbd6cbdd22b60891b504ec333
Allow enabling preposted receives with build-time flag. This is potentially an improvement for small/medium sized messages, but may require more tuning of parameters. This is off by default still: https://github.com/UIUC-PPL/charm/commit/868e39ad0738dbb3a32cca651919819cd1064a10
Add MPI_Info assertion on Charm communicator that Charm does not require MPI msg ordering. If an MPI implementation optimizes for this, it may exploit it, but no known implementations optimize for this yet: https://github.com/UIUC-PPL/charm/commit/cb4d760c4eaf478d80e2873e4dfd243933042744
Use MPI-3 matching Mprobe/Mrev functions. This potentially decreases MPI queue access times, but it was found to be implemented incorrectly in multiple MPI implementations so we had to revert it for correctness. Also, since we are almost always doing wildcard receives our MPI queue access times are actually minimal since we just take whatever is at the head of the queue: https://github.com/UIUC-PPL/charm/commit/f7fbaaaae1088ee65988c1345203da3ed55abcc4