mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/config: refine --with-ch4-shmmods options

Open hzhou opened this issue 1 year ago • 1 comments

Pull Request Description

It is a bit confusing on how to configure shmmods since posix, xpmem, and gpudirect are not equal and are currently configured inconsistently. Instead, we'll simplify --with-ch4-shmmods with "none" and "auto". "auto" is the default, means we will probe and include all shm/ipc features. --without-ch4-shmmods or `--with-ch4-shmmod=none" turns on MPIDI_CH4_DIRECT_NETMOD.

We'll detect for individual shmmods. Use --with-{xpmem,cuda,hip,ze}= to set library paths. Use --without-{xpmem,...} to force the feature out.

GPU support is checked via MPL, reflected as $GPU_SUPPORT in mpich configure.

Both posix and (future) cma are supported by standard linux, thus --with-{posix,cma} (without path value) turns the feature on. Posix will default on; CMA will default off -- due to default ptrace_scope permission policy. Use --without-{posix,cma} to explicitly turn off the feature. --without-posix currently is a noop since it cannot be turned off, but IMO it is worth to fix for consistency. We can simply make it equivalent to MPIDI_CH4_DIRECT_NETMOD.

[skip warnings]

Impact

It is backward compatible except the CSV list options e.g. --with-ch4-shmmods=posix,xpmem,gpudirect

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Jun 26 '24 15:06 hzhou

test:mpich/ch4/xpmem test:mpich/ch4/gpu/ofi test:mpich/ch4/ucx

xpmem failures: image Timeouts are sockets provider performance issue, unrelated to xpmem. The threadcomm failure will be addressed here - https://github.com/pmodels/mpich/pull/6579/commits/4455a5a7c21fca965f67a1bd3494fd093b49c360

hzhou avatar Jun 26 '24 16:06 hzhou

@raffenet The xpmem test failures are unrelated and will be addressed in #6579. Could you review this one first?

hzhou avatar Jul 09 '24 20:07 hzhou