ch4/ipc: Implement IPC support for ze
Pull Request Description
This PR implements IPC support for the ze backend in MPL.
Level zero uses file descriptors for IPC handles, so they cannot be directly transmitted between processes in a generic approach. As a result, we use pidfd_open and pidfd_getfd to duplicate the underlying fd of the handles to support generic transmission. Thus the protocol doesn't have to rely on UNIX sockets in order to share IPC handles with other processes.
This approach requires Linux Kernel version 5.6.0 or newer (pidfd_getfd is not implemented until this kernel version). The support for this method is checked at runtime, and silently disables IPC if it does not exist.
Additionally, in cases where zeMemFree cannot be captured by the MPI library (i.e. due to link ordering), there can be false hits where subsequent virtual addresses are the same, but are actually distinct allocations. In this scenario, the cached IPC handle is invalid and will cause failures. As a result, a check against the unique ID of allocations is added in the gavl cache to prevent this issue.
Author Checklist
- [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [ ] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch4/gpu/ofi
test:mpich/ch4/gpu/ofi
test:mpich/ch4/gpu/ofi
test:mpich/ch4/most
test:mpich/ch4/gpu/ofi test:mpich/ch4/most
test:mpich/ch4/most
I'm not quite sure why ch4/most tests are failing. I am not seeing the failures when running them locally with the same configs
test:mpich/ch4/gpu/ofi
test:mpich/ch4/gpu/ofi
test:mpich/ch4/gpu/ofi
test:mpich/ch4/gpu/ofi
No longer needed