mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/ipc: refactor IPC and add CMA module

Open hzhou opened this issue 2 years ago • 4 comments

Pull Request Description

This is a temporary PR for reference. It will be rebased and possibly split into separate PRs.

[skip warnings]

TODO:

  • [ ] Should we enable cma by default? Many distribution default PTRACE scope to "1", which will result in EPERM in process_vm_readv. Thus, default on will raise many support issues.

  • [ ] --with-ch4-shmmods=posix,xpmem,cma,gpudirect is unintuitive. I think it is better to use individual option e.g. --with-cma. We already use --with-xpmem and --with-cuda (and --without-` to disable).

    EDIT: address these in #7040

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Jul 03 '23 14:07 hzhou

@hzhou I am rebasing the CMA code for testing. Is the IPC cleanup commits still relevant?

yfguo avatar Jun 05 '24 22:06 yfguo

@hzhou I am rebasing the CMA code for testing. Is the IPC cleanup commits still relevant?

The IPC cleanup is the main purpose of this PR. You can try cherry-pick the CMA commit for your testing.

hzhou avatar Jun 06 '24 02:06 hzhou

I pushed a few changes to fix the GPU and non-GPU build. Also making the CMA configure option the same fashion as the rest shmmods.

Should I rebased first. Will add back the white space changes from your last update.

yfguo avatar Jun 19 '24 14:06 yfguo

test:mpich/ch4/ofi test:mpich/ch4/gpu/ofi test:mpich/ch4/xpmem All ✔️

hzhou avatar Jun 28 '24 12:06 hzhou

test:mpich/ch4/ofi ✔️ test:mpich/ch4/gpu/ofi ✔️ test:mpich/ch4/xpmem - ipc src_dt_ptr was unset

test:mpich/custom netmod: ch4:ofi config: cma

hzhou avatar Aug 06 '24 22:08 hzhou

tag @raffenet for review

hzhou avatar Aug 06 '24 22:08 hzhou

test:mpich/ch4/xpmem - 2 failures

  • TIMEOUT - vci - pt2pt/sendflood 8
  • avltree leak - debug - coll/alltoallw_zeros 8

test:mpich/custom ✔️ netmod: ch4:ofi config: cma

hzhou avatar Aug 08 '24 04:08 hzhou

test:mpich/ch4/xpmem

2 failures:

  • TIMEOUT - asan - coll/nonblocking3 5
  • TIMEOUT - vci - pt2pt/sendflood 8

These are likely performance issues not related to this PR.

hzhou avatar Aug 08 '24 13:08 hzhou

@raffenet This PR is ready to go.

hzhou avatar Aug 08 '24 18:08 hzhou