ucx
ucx copied to clipboard
rndv.c:1719 Assertion `sreq->send.rndv.lanes_count > 0' failed
Describe the bug
When running osu_bibw or osu_bw Host to Device with 2 processes on the same node, I get the assert: rndv.c:1719 Assertion `sreq->send.rndv.lanes_count > 0' failed The other process dies with a bad address error right after or with process not found Device in this case is AMD GPU
Steps to Reproduce
-
mpirun --verbose --report-bindings --mca btl ^tcp,ofi,vader --mca pml ucx --bind-to core --npernode 2 --mca mtl_base_verbose 10 --mca btl_base_verbose 10 --mca ob1_base_verbose 10 -np 2 --hostfile hostfile_osu osu_bibw -i 1 -d rocm -m 4194304:4194304 D H
-
Both UCX version 1.11.2 and 1.12 have the same issue.
Device to Device and Host to Host iterations of the test work. The assert is only encountered with the "D H" or "H D" in the osu_bibw test runs. And in the H D for the osu_bw test.
The assert backtrace is as follows
rndv.c:1719 Assertion `sreq->send.rndv.lanes_count > 0' failed
==== backtrace (tid: 66070) ====
0 /lib/libucs.so.0(ucs_fatal_error_message+0xb6) [0x7f4ad9194846]
1 /lib/libucs.so.0(+0x43912) [0x7f4ad9194912]
2 /lib/libucp.so.0(+0x9a64b) [0x7f4ad93af64b]
3 /lib/libucp.so.0(ucp_rndv_atp_handler+0x275) [0x7f4ad93b0965]
4 /lib/libuct.so.0(+0x257ce) [0x7f4ad91e27ce]
5 /lib/libucp.so.0(ucp_worker_progress+0x86) [0x7f4ad937cc36]
6 /lib/libopen-pal.so.40(opal_progress+0x2d) [0x7f4aec77453d]
7 /lib/libmpi.so.40(ompi_request_default_wait_all+0x115) [0x7f4af7bd6ac5]
8 /lib/libmpi.so.40(PMPI_Waitall+0x8b) [0x7f4af7c18e3b]
9 /pt2pt/osu_bibw() [0x207a45]
10 /lib64/libc.so.6(__libc_start_main+0xed) [0x7f4af456d34d]
11 /pt2pt/osu_bibw() [0x20661a]
I also printed the backtrace at the point of the assert
rndv.c:1701 UCX REQ req 0x9a5e80: /lib/libucp.so.0(ucp_rndv_progress_rma_put_zcopy+0x46) [0x7f4ad93af436]
rndv.c:1701 UCX REQ req 0x9a5e80: /lib/libucp.so.0(ucp_rndv_atp_handler+0x275) [0x7f4ad93b0965]
rndv.c:1701 UCX REQ req 0x9a5e80: /lib/libuct.so.0(+0x257ce) [0x7f4ad91e27ce]
rndv.c:1701 UCX REQ req 0x9a5e80: /lib/libucp.so.0(ucp_worker_progress+0x86) [0x7f4ad937cc36]
rndv.c:1701 UCX REQ req 0x9a5e80: /lib/libopen-pal.so.40(opal_progress+0x2d) [0x7f4aec77453d]
rndv.c:1701 UCX REQ req 0x9a5e80: /lib/libmpi.so.40(ompi_request_default_wait_all+0x115) [0x7f4af7bd6ac5]
rndv.c:1701 UCX REQ req 0x9a5e80: /lib/libmpi.so.40(PMPI_Waitall+0x8b) [0x7f4af7c18e3b]
rndv.c:1701 UCX REQ req 0x9a5e80: /pt2pt/osu_bibw() [0x207a45]
rndv.c:1701 UCX REQ req 0x9a5e80: /lib64/libc.so.6(__libc_start_main+0xed) [0x7f4af456d34d]
rndv.c:1701 UCX REQ req 0x9a5e80: /pt2pt/osu_bibw() [0x20661a]
Setup and versions
- 5.3.18-59.16_11.0.39-cray_shasta_c #1 SMP Mon Nov 1 22:07:03 UTC 2021 (5921938) x86_64 x86_64 x86_64 GNU/Linux
Additional information (depending on the issue)
- OpenMPI version: ompi-4.1.2
# Memory domain: posix
# Component: posix
# Transport: posix
# Device: memory
# System device: <unknown>
# Memory domain: sysv
# Component: sysv
# Transport: sysv
# Device: memory
# System device: <unknown>
# Memory domain: self
# Component: self
# Transport: self
# Device: memory0
# System device: <unknown>
# Memory domain: tcp
# Component: tcp
# Transport: tcp
# Device: hsn2
# System device: <unknown>
# Transport: tcp
# Device: bond0
# System device: <unknown>
# Transport: tcp
# Device: hsn0
# System device: <unknown>
# Transport: tcp
# Device: hsn3
# System device: <unknown>
# Transport: tcp
# Device: lo
# System device: <unknown>
# Transport: tcp
# Device: hsn1
# System device: <unknown>
# Memory domain: rocm_cpy
# Component: rocm_cpy
# Transport: rocm_copy
# Device: rocm_cpy
# System device: <unknown>
# Memory domain: rocm_ipc
# Component: rocm_ipc
# Transport: rocm_ipc
# Device: rocm_ipc
# System device: <unknown>
# Memory domain: cma
# Component: cma
# Transport: cma
# Device: memory
# System device: <unknown>
# Memory domain: xpmem
# Component: xpmem
# Transport: xpmem
# Device: memory
# System device: <unknown>
@amirpavlo Thank you for reporting. I can confirm that I can reproduce the problem. A quick fix for resolving the issue is to add the UCX_RNDV_SCHEME=put_zcopy environment variable to your command line, e.g.
mpirun -x UCX_RNDV_SCHEME=put_zcopy -np 2 ./mytest
I will post a more detailed analysis of the problem shortly.
@edgargabriel I'm running into the same issue, was it ever resolved?
I can confirm same issue still happening with UCX v1.13.1 and ompi v4.1.4.
I think it should be resolved in the upcoming 1.14.0 release, but otherwise, as I mentioned above, I recommend setting -x UCX_RNDV_SCHEME=put_zcopy
argument with mpirun
@koomie in fact, if you have time, could you test your code with the ucx-1.14.0-rc2 release to see whether the issue is resolved there? That would be valuable input/feedback! (See https://github.com/openucx/ucx/releases for downloading the tar-ball)
@koomie in fact, if you have time, could you test your code with the ucx-1.14.0-rc2 release to see whether the issue is resolved there? That would be valuable input/feedback! (See https://github.com/openucx/ucx/releases for downloading the tar-ball)
Sure. I can hopefully do that later today and get back to you.
Similar issue with the 1.14.0 RC2 tarball I'm afraid. Setting UCX_RNDV_SCHEME=put_zcopy
still required to avoid the assert error for me.
just for documentation purposes, we are communicating and working on it off the list.
Quick update in my case: things are working for me now with the ucx 1.14.0 rc2 release. The issue in my environment was trying to use UCX + ROCm with upstreamed ofed drivers in RHEL 9.1. Switching to MLNX OFED drivers resolved my particular issue and I do not need to set UCX_RNDV_SCHEME=put_zcopy
.
Thanks to @edgargabriel for the help sleuthing.