mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/ipc: Implement IPC support for ze

Open abrooks98 opened this issue 3 years ago • 11 comments

Pull Request Description

This PR implements IPC support for the ze backend in MPL.

Level zero uses file descriptors for IPC handles, so they cannot be directly transmitted between processes in a generic approach. As a result, we use pidfd_open and pidfd_getfd to duplicate the underlying fd of the handles to support generic transmission. Thus the protocol doesn't have to rely on UNIX sockets in order to share IPC handles with other processes.

This approach requires Linux Kernel version 5.6.0 or newer (pidfd_getfd is not implemented until this kernel version). The support for this method is checked at runtime, and silently disables IPC if it does not exist.

Additionally, in cases where zeMemFree cannot be captured by the MPI library (i.e. due to link ordering), there can be false hits where subsequent virtual addresses are the same, but are actually distinct allocations. In this scenario, the cached IPC handle is invalid and will cause failures. As a result, a check against the unique ID of allocations is added in the gavl cache to prevent this issue.

Author Checklist

  • [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [ ] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

abrooks98 avatar Jun 29 '22 22:06 abrooks98

test:mpich/ch4/gpu/ofi

abrooks98 avatar Jun 29 '22 23:06 abrooks98

test:mpich/ch4/gpu/ofi

abrooks98 avatar Jun 30 '22 13:06 abrooks98

test:mpich/ch4/gpu/ofi

abrooks98 avatar Jun 30 '22 14:06 abrooks98

test:mpich/ch4/most

abrooks98 avatar Jun 30 '22 19:06 abrooks98

test:mpich/ch4/gpu/ofi test:mpich/ch4/most

abrooks98 avatar Jun 30 '22 22:06 abrooks98

test:mpich/ch4/most

abrooks98 avatar Jul 08 '22 18:07 abrooks98

I'm not quite sure why ch4/most tests are failing. I am not seeing the failures when running them locally with the same configs

abrooks98 avatar Jul 08 '22 19:07 abrooks98

test:mpich/ch4/gpu/ofi

abrooks98 avatar Jul 08 '22 21:07 abrooks98

test:mpich/ch4/gpu/ofi

abrooks98 avatar Aug 15 '22 22:08 abrooks98

test:mpich/ch4/gpu/ofi

abrooks98 avatar Aug 16 '22 15:08 abrooks98

test:mpich/ch4/gpu/ofi

abrooks98 avatar Sep 07 '22 17:09 abrooks98

No longer needed

abrooks98 avatar Jun 26 '23 21:06 abrooks98