mpich
mpich copied to clipboard
jenkins: ucx+gpu fails coll/reduce and coll/allred2
Both triggers UCX internal error due to invalid stream handle:
not ok - ./coll/allred2 4
...
## Test output (expected 'No Errors'):
## [1635853399.519362] [pmrs-gpu-240-02:154094:0] cuda_copy_ep.c:83 UCX ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
## [pmrs-gpu-240-02:154094:0:154094] rndv.c:873 Assertion `rreq->recv.remaining >= freq->send.length' failed: rreq->recv.remaining 0, freq->send.length 16384
##
## /var/lib/jenkins-slave/workspace/mpich-main-ch4-gpu/compiler/intel/jenkins_configure/debug/label/gpu/netmod/ucx/mpich-main/modules/ucx/src/ucp/rndv/rndv.c: [ ucp_rndv_recv_frag_put_completion() ]
## ...
## 868 ucs_trace_req("freq:%p: recv_frag_put done, rreq:%p ", freq, rreq);
## 869 }
## 870
## ==> 871 ucs_assertv(rreq->recv.remaining >= freq->send.length,
## 872 "rreq->recv.remaining %zu, freq->send.length %zu",
## 873 rreq->recv.remaining, freq->send.length);
## 874 rreq->recv.remaining -= freq->send.length;
##
## ==== backtrace (tid: 154094) ====
...
not ok - ./coll/reduce 7
## Test output (expected 'No Errors'):
## [1635853412.359215] [pmrs-gpu-240-02:154306:0] cuda_copy_ep.c:83 UCX ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
## [1635853412.364972] [pmrs-gpu-240-02:154306:0] cuda_copy_ep.c:83 UCX ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
...
## No Errors
A narrowed-down reproduce:
[0] ./allred2 -evenmemtype=device -oddmemtype=device
[0] TEST MPI_COMM_WORLD
[0] count = 2000
[1] TEST MPI_COMM_WORLD
[1] count = 2000
[2] TEST MPI_COMM_WORLD
[2] count = 2000
[3] TEST MPI_COMM_WORLD
[3] count = 2000
[0] count = 4000
[1] count = 4000
[2] count = 4000
[3] count = 4000
[0] count = 8000
[1] count = 8000
[2] count = 8000
[3] count = 8000
[0] TEST Dup of MPI_COMM_WORLD
[0] count = 2000
[1] TEST Dup of MPI_COMM_WORLD
[1] count = 2000
[2] TEST Dup of MPI_COMM_WORLD
[2] count = 2000
[3] TEST Dup of MPI_COMM_WORLD
[3] count = 2000
[0] count = 4000
[1] count = 4000
[3] count = 4000
[2] count = 4000
[1] count = 8000
[0] count = 8000
[3] count = 8000
[2] count = 8000
[0] TEST Rank reverse of MPI_COMM_WORLD
[0] count = 2000
[1] TEST Rank reverse of MPI_COMM_WORLD
[1] count = 2000
[2] TEST Rank reverse of MPI_COMM_WORLD
[2] count = 2000
[3] TEST Rank reverse of MPI_COMM_WORLD
[3] count = 2000
[2] count = 4000
[3] count = 4000
[1] count = 4000
[0] count = 4000
[3] count = 8000
[2] count = 8000
[1] count = 8000
[0] count = 8000
[2] [1636651855.793960] [pmrs-gpu-240-02:190038:0] cuda_copy_ep.c:83 UCX ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
[2] [pmrs-gpu-240-02:190038:0:190038] rndv.c:762 Assertion `req->recv.remaining >= freq->send.length' failed: req->recv.remaining 0, freq->send.length 16000
[3] [1636651855.794659] [pmrs-gpu-240-02:190039:0] cuda_copy_ep.c:83 UCX ERROR cudaEventRecord(cuda_event->event, iface->stream[id])() failed: invalid resource handle
[3] [pmrs-gpu-240-02:190039:0:190039] rndv.c:762 Assertion `req->recv.remaining >= freq->send.length' failed: req->recv.remaining 0, freq->send.length 16000
If we omite the smaller counts, the bug doesn't reproduce.
Just tried the latest ucx master (commit 68fa8ee661deafc826716a72be88d629e5f41f38) and the test passed. Looking through the log, it is not clear which patch fixed it, but there were a few patches touching the am rndv path.