mpich jenkins: ucx+gpu fails coll/reduce and coll/allred2

Both triggers UCX internal error due to invalid stream handle:

not ok  - ./coll/allred2 4
  ...
## Test output (expected 'No Errors'):
## [1635853399.519362] [pmrs-gpu-240-02:154094:0]    cuda_copy_ep.c:83   UCX  ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
## [pmrs-gpu-240-02:154094:0:154094]        rndv.c:873  Assertion `rreq->recv.remaining >= freq->send.length' failed: rreq->recv.remaining 0, freq->send.length 16384
## 
## /var/lib/jenkins-slave/workspace/mpich-main-ch4-gpu/compiler/intel/jenkins_configure/debug/label/gpu/netmod/ucx/mpich-main/modules/ucx/src/ucp/rndv/rndv.c: [ ucp_rndv_recv_frag_put_completion() ]
##       ...
##       868         ucs_trace_req("freq:%p: recv_frag_put done, rreq:%p ", freq, rreq);
##       869     }
##       870 
## ==>   871     ucs_assertv(rreq->recv.remaining >= freq->send.length,
##       872                 "rreq->recv.remaining %zu, freq->send.length %zu",
##       873                 rreq->recv.remaining, freq->send.length);
##       874     rreq->recv.remaining -= freq->send.length;
## 
## ==== backtrace (tid: 154094) ====
...

not ok  - ./coll/reduce 7
## Test output (expected 'No Errors'):
## [1635853412.359215] [pmrs-gpu-240-02:154306:0]    cuda_copy_ep.c:83   UCX  ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
## [1635853412.364972] [pmrs-gpu-240-02:154306:0]    cuda_copy_ep.c:83   UCX  ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
...
##  No Errors

Nov 02 '21 21:11 hzhou

A narrowed-down reproduce:

[0]  ./allred2 -evenmemtype=device -oddmemtype=device
[0] TEST MPI_COMM_WORLD
[0]  count = 2000
[1] TEST MPI_COMM_WORLD
[1]  count = 2000
[2] TEST MPI_COMM_WORLD
[2]  count = 2000
[3] TEST MPI_COMM_WORLD
[3]  count = 2000
[0]  count = 4000
[1]  count = 4000
[2]  count = 4000
[3]  count = 4000
[0]  count = 8000
[1]  count = 8000
[2]  count = 8000
[3]  count = 8000
[0] TEST Dup of MPI_COMM_WORLD
[0]  count = 2000
[1] TEST Dup of MPI_COMM_WORLD
[1]  count = 2000
[2] TEST Dup of MPI_COMM_WORLD
[2]  count = 2000
[3] TEST Dup of MPI_COMM_WORLD
[3]  count = 2000
[0]  count = 4000
[1]  count = 4000
[3]  count = 4000
[2]  count = 4000
[1]  count = 8000
[0]  count = 8000
[3]  count = 8000
[2]  count = 8000
[0] TEST Rank reverse of MPI_COMM_WORLD
[0]  count = 2000
[1] TEST Rank reverse of MPI_COMM_WORLD
[1]  count = 2000
[2] TEST Rank reverse of MPI_COMM_WORLD
[2]  count = 2000
[3] TEST Rank reverse of MPI_COMM_WORLD
[3]  count = 2000
[2]  count = 4000
[3]  count = 4000
[1]  count = 4000
[0]  count = 4000
[3]  count = 8000
[2]  count = 8000
[1]  count = 8000
[0]  count = 8000
[2] [1636651855.793960] [pmrs-gpu-240-02:190038:0]   cuda_copy_ep.c:83   UCX  ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
[2] [pmrs-gpu-240-02:190038:0:190038]        rndv.c:762  Assertion `req->recv.remaining >= freq->send.length' failed: req->recv.remaining 0, freq->send.length 16000
[3] [1636651855.794659] [pmrs-gpu-240-02:190039:0]   cuda_copy_ep.c:83   UCX  ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
[3] [pmrs-gpu-240-02:190039:0:190039]        rndv.c:762  Assertion `req->recv.remaining >= freq->send.length' failed: req->recv.remaining 0, freq->send.length 16000

If we omite the smaller counts, the bug doesn't reproduce.

Nov 11 '21 17:11 hzhou

Just tried the latest ucx master (commit 68fa8ee661deafc826716a72be88d629e5f41f38) and the test passed. Looking through the log, it is not clear which patch fixed it, but there were a few patches touching the am rndv path.

Nov 11 '21 18:11 hzhou