mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/ucx: using the ucx nbx am interface

Open hzhou opened this issue 4 years ago • 2 comments

Pull Request Description

The new ucx am interface allows internal rendezvous protocol, which allows the possibility to taking advantage of direct GPU IPC among other benefits. This PR implements ucx am mode using this new am interface.

TODO

  • [ ] update the version check to ensure the GPU is supported in the new ucp_am_send_nbx api

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Jun 27 '21 23:06 hzhou

test:mpich/ch4/most

hzhou avatar Jul 03 '21 20:07 hzhou

test:mpich/ch4/ucx

hzhou avatar Jul 10 '21 03:07 hzhou

test:mpich/ch4/ucx test:mpich/ch4/gpu/ucx

hzhou avatar Mar 31 '23 14:03 hzhou

test:mpich/ch4/most

1 timeout in ch4-ofi-direct-nm: ./rma/pscw_ordering 4 -- Never seen this before, will check

1 failure in ch4-ucx-am-only:

not ok 2706 - ./errors/pt2pt/truncmsg2 2
  ---
  Directory: ./errors/pt2pt
  File: truncmsg2
  Num-procs: 2
  Timeout: 180
  Date: "Thu Aug 17 12:53:34 2023"
  ...
## Test output (expected 'No Errors'):
## [pmrs-centos64-240-04:116679:0:116679] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe8)
## ==== backtrace (tid: 116679) ====
##  0 0x00000000000363b0 killpg()  ???:0
##  1 0x00000000000498db MPIDI_NM_mpi_irecv.isra.79()  c_binding.c:0
##  2 0x000000000010eed7 MPI_Recv()  ???:0
##  3 0x0000000000401d24 main()  /var/lib/jenkins-slave/workspace/mpich-review-ch4-ucx/jenkins_configure/am-only/label/centos64_review/test/mpi/errors/pt2pt/truncmsg2.c:114
##  4 0x0000000000022505 __libc_start_main()  ???:0
##  5 0x0000000000401e86 _start()  ???:0
## =================================

I can't reproduce.

hzhou avatar Aug 17 '23 16:08 hzhou

test:mpich/ch4/ucx ✔️

hzhou avatar Aug 17 '23 19:08 hzhou

test:mpich/ch4/ofi Now this one fails in ch4-ofi-am-only ./errors/pt2pt/truncmsg2 2

hzhou avatar Aug 17 '23 19:08 hzhou

Can't reproduce. So let me test it one more time - test:mpich/ch4/ofi ✔️

hzhou avatar Aug 17 '23 22:08 hzhou

am-only truncmsg2 failures tracked in #6641

hzhou avatar Aug 18 '23 14:08 hzhou