ucx icon indicating copy to clipboard operation
ucx copied to clipboard

recv side worker fails with ucp_put_nbx()

Open Yiltan opened this issue 1 year ago • 2 comments

The recv-side worker fails with the following if I call ucp_put_nbx()

I've error checked ucp_mem_map, ucp_rkey_pack, ucp_ep_rkey_unpack, etc

Any suggestions on why this error could occur? (this is my own UCX application)

[orchid05:4081700:0:4081700] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:4081700) ====
 0 0x0000000000054db0 __GI___sigaction()  :0
 1 0x00000000000caa0c __memcpy_evex_unaligned_erms()  :0
 2 0x0000000000061925 ucp_put_handler()  /project/21abs10/ucx-1.14.1/build/src/ucp/../../../src/ucp/rma/rma_sw.c:163
 3 0x000000000003e1e0 uct_iface_invoke_am()  /project/21abs10/ucx-1.14.1/build/../src/uct/base/uct_iface.h:904
 4 0x000000000003e1e0 uct_rc_mlx5_iface_common_am_handler()  /project/21abs10/ucx-1.14.1/build/src/uct/ib/../../../../src/uct/ib/rc/accel/rc_mlx5.inl:410
 5 0x000000000003e1e0 uct_rc_mlx5_iface_common_poll_rx()  /project/21abs10/ucx-1.14.1/build/src/uct/ib/../../../../src/uct/ib/rc/accel/rc_mlx5.inl:1439
 6 0x000000000003e1e0 uct_rc_mlx5_iface_progress()  /project/21abs10/ucx-1.14.1/build/src/uct/ib/../../../../src/uct/ib/rc/accel/rc_mlx5_iface.c:185
 7 0x000000000003e1e0 uct_rc_mlx5_iface_progress_cyclic()  /project/21abs10/ucx-1.14.1/build/src/uct/ib/../../../../src/uct/ib/rc/accel/rc_mlx5_iface.c:195
 8 0x0000000000055908 ucs_callbackq_slow_proxy()  /project/21abs10/ucx-1.14.1/build/src/ucs/../../../src/ucs/datastruct/callbackq.c:404
 9 0x0000000000046a3a ucs_callbackq_dispatch()  /project/21abs10/ucx-1.14.1/build/../src/ucs/datastruct/callbackq.h:211
10 0x0000000000046a3a uct_worker_progress()  /project/21abs10/ucx-1.14.1/build/../src/uct/api/uct.h:2768
11 0x0000000000046a3a ucp_worker_progress()  /project/21abs10/ucx-1.14.1/build/src/ucp/../../../src/ucp/core/ucp_worker.c:2814

Yiltan avatar Jan 22 '24 22:01 Yiltan

Seems like target memory address is invalid. can you share the test application?

yosefe avatar Jan 23 '24 09:01 yosefe

I extracted the relevenet code, it can be seen here: https://gist.github.com/Yiltan/648d19e8f6874b6c56222f1e07d47132

The worker progress that crashes is on line 291. This happens inconstantly, if it doesn't crash then we hang as the atomic doesn't update our target buffer.

Yiltan avatar Jan 23 '24 15:01 Yiltan