ucx
ucx copied to clipboard
recv side worker fails with ucp_put_nbx()
The recv-side worker fails with the following if I call ucp_put_nbx()
I've error checked ucp_mem_map, ucp_rkey_pack, ucp_ep_rkey_unpack, etc
Any suggestions on why this error could occur? (this is my own UCX application)
[orchid05:4081700:0:4081700] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:4081700) ====
0 0x0000000000054db0 __GI___sigaction() :0
1 0x00000000000caa0c __memcpy_evex_unaligned_erms() :0
2 0x0000000000061925 ucp_put_handler() /project/21abs10/ucx-1.14.1/build/src/ucp/../../../src/ucp/rma/rma_sw.c:163
3 0x000000000003e1e0 uct_iface_invoke_am() /project/21abs10/ucx-1.14.1/build/../src/uct/base/uct_iface.h:904
4 0x000000000003e1e0 uct_rc_mlx5_iface_common_am_handler() /project/21abs10/ucx-1.14.1/build/src/uct/ib/../../../../src/uct/ib/rc/accel/rc_mlx5.inl:410
5 0x000000000003e1e0 uct_rc_mlx5_iface_common_poll_rx() /project/21abs10/ucx-1.14.1/build/src/uct/ib/../../../../src/uct/ib/rc/accel/rc_mlx5.inl:1439
6 0x000000000003e1e0 uct_rc_mlx5_iface_progress() /project/21abs10/ucx-1.14.1/build/src/uct/ib/../../../../src/uct/ib/rc/accel/rc_mlx5_iface.c:185
7 0x000000000003e1e0 uct_rc_mlx5_iface_progress_cyclic() /project/21abs10/ucx-1.14.1/build/src/uct/ib/../../../../src/uct/ib/rc/accel/rc_mlx5_iface.c:195
8 0x0000000000055908 ucs_callbackq_slow_proxy() /project/21abs10/ucx-1.14.1/build/src/ucs/../../../src/ucs/datastruct/callbackq.c:404
9 0x0000000000046a3a ucs_callbackq_dispatch() /project/21abs10/ucx-1.14.1/build/../src/ucs/datastruct/callbackq.h:211
10 0x0000000000046a3a uct_worker_progress() /project/21abs10/ucx-1.14.1/build/../src/uct/api/uct.h:2768
11 0x0000000000046a3a ucp_worker_progress() /project/21abs10/ucx-1.14.1/build/src/ucp/../../../src/ucp/core/ucp_worker.c:2814
Seems like target memory address is invalid. can you share the test application?
I extracted the relevenet code, it can be seen here: https://gist.github.com/Yiltan/648d19e8f6874b6c56222f1e07d47132
The worker progress that crashes is on line 291. This happens inconstantly, if it doesn't crash then we hang as the atomic doesn't update our target buffer.