mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/am: Caching buffer attribute in request and use typerep fast path for H2H

Open yfguo opened this issue 1 year ago • 3 comments

Pull Request Description

This PR add three things

  1. A fast path in typerep copy/pack/unpack for H2H case. It bypasses pointer attribute query and related branches. This is enabled with new flag MPIR_TYPEREP_FLAG_H2H. SHM pipeline can benefit from this skipping a few memory access and branches.
  2. A cache for buffer attribute in request. AM recv now checks buffer attribute and save it in recv request. The recv datacopy will utilize this info and choose the typerep H2H path if possible.
  3. POSIX SHM iqueue also checks buffer attribute and utilize typerep H2H path.

Request for comments (@hzhou @raffenet ):

  1. It might be beneficial to generalize this and add H2D and D2H path, so we can save at least one buffer attribute check on typerep. For GPU pipeline it should be a good thing as different segment of the same buffer should always have the same attribute.
  2. Passing buffer info from POSIX to eager module is missing. I am adding new POSIX_AM_TYPE right now, which is not the best solution IMO.

Author Checklist

  • [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [ ] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

yfguo avatar Aug 01 '24 19:08 yfguo

test:mpich/ch4/ofi

yfguo avatar Aug 01 '24 19:08 yfguo

Here is the context of this PR.

I was testing the performance of fbox and found that building with yaksa vs dataloop has a 0.02-0.03 us difference in latency for 1B message. Since no derived datatype is involved, the data copying should mostly go through the memcpy path in the typerep_copy/pack. The difference between typerep_yaksa and typerep_dataloop is the checks on buffer pointer attributes.

I have not check GPU build yet. But I think it would be a good thing that we cache these info instead of checking them every time.

yfguo avatar Aug 01 '24 19:08 yfguo

test:mpich/ch4/ofi

yfguo avatar Aug 01 '24 19:08 yfguo