ch4/ucx: using the ucx nbx am interface
Pull Request Description
The new ucx am interface allows internal rendezvous protocol, which allows the possibility to taking advantage of direct GPU IPC among other benefits. This PR implements ucx am mode using this new am interface.
TODO
- [ ] update the version check to ensure the GPU is supported in the new ucp_am_send_nbx api
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch4/most
test:mpich/ch4/ucx
test:mpich/ch4/ucx test:mpich/ch4/gpu/ucx
test:mpich/ch4/most
1 timeout in ch4-ofi-direct-nm: ./rma/pscw_ordering 4 -- Never seen this before, will check
1 failure in ch4-ucx-am-only:
not ok 2706 - ./errors/pt2pt/truncmsg2 2
---
Directory: ./errors/pt2pt
File: truncmsg2
Num-procs: 2
Timeout: 180
Date: "Thu Aug 17 12:53:34 2023"
...
## Test output (expected 'No Errors'):
## [pmrs-centos64-240-04:116679:0:116679] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe8)
## ==== backtrace (tid: 116679) ====
## 0 0x00000000000363b0 killpg() ???:0
## 1 0x00000000000498db MPIDI_NM_mpi_irecv.isra.79() c_binding.c:0
## 2 0x000000000010eed7 MPI_Recv() ???:0
## 3 0x0000000000401d24 main() /var/lib/jenkins-slave/workspace/mpich-review-ch4-ucx/jenkins_configure/am-only/label/centos64_review/test/mpi/errors/pt2pt/truncmsg2.c:114
## 4 0x0000000000022505 __libc_start_main() ???:0
## 5 0x0000000000401e86 _start() ???:0
## =================================
I can't reproduce.
test:mpich/ch4/ucx ✔️
test:mpich/ch4/ofi
Now this one fails in ch4-ofi-am-only ./errors/pt2pt/truncmsg2 2
Can't reproduce. So let me test it one more time - test:mpich/ch4/ofi ✔️
am-only truncmsg2 failures tracked in #6641