ucx
ucx copied to clipboard
Failed assertion using active messaging in multi-threaded environment
Describe the bug
Using active messaging (ucp_am_send_nbx
) on the same ~(TCP)~ endpoint from withing two different threads results in a failed assertion, although the library is compiled with multithreading support and all corresponding flags are set.
proto_am.inl:257 Assertion `max_iov > 1' failed
==== backtrace (tid: 33396) ====
0 /usr/lib/libucs.so.0(ucs_handle_error+0x2c4) [0x7ffff786a404]
1 /usr/lib/libucs.so.0(ucs_fatal_error_message+0xc9) [0x7ffff7866f09]
2 /usr/lib/libucs.so.0(ucs_fatal_error_format+0xe1) [0x7ffff7866ff1]
3 /usr/lib/libucp.so.0(+0x20f18) [0x7ffff7e4df18]
4 /usr/lib/libucp.so.0(ucp_wireup_replay_pending_requests+0x49) [0x7ffff7f21839]
5 /usr/lib/libucp.so.0(ucp_wireup_init_lanes+0x1822) [0x7ffff7f232c2]
6 /usr/lib/libucp.so.0(+0xf7ba5) [0x7ffff7f24ba5]
7 /usr/lib/libucp.so.0(+0xf817e) [0x7ffff7f2517e]
8 /usr/lib/libuct.so.0(+0x2752f) [0x7ffff78d552f]
9 /usr/lib/libuct.so.0(+0x2ce66) [0x7ffff78dae66]
10 /usr/lib/libucs.so.0(ucs_event_set_wait+0x101) [0x7ffff7878eb1]
11 /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x8f) [0x7ffff78d9aaf]
12 /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7ffff7e7f31a]
Context Parameters
ucp_params_t context_params{};
context_params.features = UCP_FEATURE_AM;
context_params.mt_workers_shared = 1;
context_params.field_mask = UCP_PARAM_FIELD_FEATURES | UCP_PARAM_FIELD_MT_WORKERS_SHARED;
Worker Parameters
ucp_worker_params_t worker_params{};
worker_params.thread_mode = UCS_THREAD_MODE_MULTI;
worker_params.field_mask = UCP_WORKER_PARAM_FIELD_THREAD_MODE;
Steps to Reproduce
- UCX version 1.12.1
- Configured with
contrib/configure-devel
(which contains--enable-mt
)
Setup and versions
- TCP Connection (no InfiniBand hardware)
-
Linux 5.17.1-3-MANJARO
(x86_64) -
rdma-core 39.0
I created a small minimal working example program where each of two threads sends messages using ucp_am_send_nbx
.
https://gist.github.com/krakowski/8a30c6bf3e94b727e9e1ac268cbb8045
The assertion does not always fail, as it works most of the time. If it fails, then it is almost always at the beginning of the program which makes me think if this is related to connection establishment happening at the same time from both threads due to bad timing. Attaching gdb
after the crash shows the following.
@krakowski I tried to reproduce the issue using you reproducer and setting UCX_TLS=tcp
. But it works fine with 1.12.1 and master.
@dmitrygx Thanks for looking into this!
It works for me most of the time, too. I'm not really sure how I can reproduce this reliably. If I hit the right timing it crashes with the mentioned failed assertion.
I noticed that the connection gets established as soon as the first message is sent out. Could there be some synchronization error regarding connection establishment happening in two separate threads at the same time?
It works for me most of the time, too.
how many iterations does it take to catch this?
I noticed that the connection gets established as soon as the first message is sent out. Could there be some synchronization error regarding connection establishment happening in two separate threads at the same time?
yes, it could be, but I checked that we acquire locks when processing connection packets and as a result reposting user's AM operations.
@dmitrygx I just updated my gist. This version crashes every time I try to run it.
https://gist.github.com/krakowski/8a30c6bf3e94b727e9e1ac268cbb8045
Things I changed:
- Use the same buffer for
ucp_am_send_nbx
sheader
andbuffer
parameters. In the first version I just sent the header. - Use 8 Threads for sending messages.
- Make all threads wait just in front of the first
ucp_am_send_nbx
call.
@krakowski thanks! but it still works fine using this dc9243556ec10178fc9fc2f8be5e17f2bf79cd19 commit. see the output of the client (it is mixed due to threads contention):
$UCX_TLS=tcp ./ucp_am_mt client
>> Initializing context
>> Creating worker
>> Establishing connection
>> Connection established
>> Sending messages
[[[[140596752672512[140596744279808] Sent 140596710708992] Sent ] Sent 140596838344448] Sent 0 messages0 messages
140596719101696] Sent 0 messages
0 messages
[140596727494400] Sent 0 messages0 messages
[140596761065216] Sent 0 messages
[140596735887104] Sent 0 messages
[140596838344448] Sent 1000000 messages
[140596719101696] Sent 1000000 messages
[140596761065216] Sent 1000000 messages
[140596761065216] Sent 2000000 messages
[140596838344448] Sent 2000000 messages
[140596719101696] Sent 2000000 messages
[140596727494400] Sent 1000000 messages
[140596752672512] Sent 1000000 messages
[140596719101696] Sent 3000000 messages
[140596761065216] Sent 3000000 messages
[140596838344448] Sent 3000000 messages
[140596719101696] Sent 4000000 messages
[140596727494400] Sent 2000000 messages
[140596761065216] Sent 4000000 messages
[140596838344448] Sent 4000000 messages
[140596752672512] Sent 2000000 messages
[140596710708992] Sent 1000000 messages
[140596727494400] Sent 3000000 messages
[140596710708992] Sent 2000000 messages
[140596710708992] Sent 3000000 messages
[140596710708992] Sent 4000000 messages
[140596727494400] Sent 4000000 messages
[140596744279808] Sent 1000000 messages
[140596752672512] Sent 3000000 messages
[140596744279808] Sent 2000000 messages
[140596744279808] Sent 3000000 messages
[140596744279808] Sent 4000000 messages
[140596752672512] Sent 4000000 messages
[140596735887104] Sent 1000000 messages
[140596735887104] Sent 2000000 messages
[140596735887104] Sent 3000000 messages
[140596735887104] Sent 4000000 messages
>> Finished. Shutting down...
Strange... I will try to debug it further tomorrow to see why the assertion fails on my machine.
I also could not really find out what max_iov
stands for. Can you please explain briefly what the assertion checks? This could help me in tracking down what is happening. Thanks for helping out! 🙂
@dmitrygx I managed to narrow down the problem a bit more. The crash does only occur, when UCX chooses shm
and tcp
as transports. Running the demo app with UCX_TLS=tcp
works for me, too. As soon as I add shm
(UCX_TLS=shm,tcp
), it crashes immediately. Is this expected to happen with this kind of configuration? I started the demo app without explicitly specifying the transport (assuming it would choose tcp
), in which case it obviously choose shm,tcp
(due to UCX_TLS=all
) automatically based on my local available transports.
$ UCX_TLS=shm,tcp ./openucx_examples client
>> Initializing context
>> Creating worker
>> Establishing connection
>> Connection established
>> Sending messages
proto_am.inl:257 Assertion `max_iov > 1' failed
==== backtrace (tid: 3337) ====
0 /usr/lib/libucs.so.0(ucs_handle_error+0x2c4) [0x7f568b854404]
1 /usr/lib/libucs.so.0(ucs_fatal_error_message+0xc9) [0x7f568b850f09]
2 /usr/lib/libucs.so.0(ucs_fatal_error_format+0xe1) [0x7f568b850ff1]
3 /usr/lib/libucp.so.0(+0x20f18) [0x7f568be37f18]
4 /usr/lib/libucp.so.0(ucp_wireup_replay_pending_requests+0x49) [0x7f568bf0b839]
5 /usr/lib/libucp.so.0(ucp_wireup_init_lanes+0x1822) [0x7f568bf0d2c2]
6 /usr/lib/libucp.so.0(+0xf7ba5) [0x7f568bf0eba5]
7 /usr/lib/libucp.so.0(+0xf817e) [0x7f568bf0f17e]
8 /usr/lib/libuct.so.0(+0x2752f) [0x7f568b8bf52f]
9 /usr/lib/libuct.so.0(+0x2ce66) [0x7f568b8c4e66]
10 /usr/lib/libucs.so.0(ucs_event_set_wait+0x101) [0x7f568b862eb1]
11 /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x8f) [0x7f568b8c3aaf]
12 /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7f568be6931a]
13 ./openucx_examples(+0x23ac) [0x55929bda73ac]
14 ./openucx_examples(+0x245c) [0x55929bda745c]
15 ./openucx_examples(+0x3e4d) [0x55929bda8e4d]
16 ./openucx_examples(+0x3d48) [0x55929bda8d48]
17 ./openucx_examples(+0x3c3f) [0x55929bda8c3f]
18 ./openucx_examples(+0x3bc0) [0x55929bda8bc0]
19 ./openucx_examples(+0x3ba4) [0x55929bda8ba4]
20 /usr/lib/libstdc++.so.6(+0xd54d4) [0x7f568bcc64d4]
21 /usr/lib/libc.so.6(+0x8d5c2) [0x7f568b9715c2]
22 /usr/lib/libc.so.6(clone+0x44) [0x7f568b9f6584]
=================================
I also could not really find out what
max_iov
stands for. Can you please explain briefly what the assertion checks?
it expects that transport's supported number of IOVs will be greater than 1
, since we need to pack user's header to the IOV.
but the problem not in the max_iov
, the problem is in EP reconfiguration from one thread while another thread relies on the previous configuration.
The crash does only occur, when UCX chooses shm and tcp as transports.
I tried UCX_TLS=shm,tcp
, but it still works fine :)
$UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./ucp_am_mt client
>> Initializing context
[1651306477.580326] [swx-ucx01:420 :0] ucp_context.c:1778 UCX INFO UCP version is 1.12 (release 1)
>> Creating worker
[1651306477.626438] [swx-ucx01:420 :0] parser.c:1914 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_TLS=shm,tcp
>> Establishing connection
>> Connection established
>> Sending messages
[1651306477.632434] [swx-ucx01:420 :a] rdmacm_cm.c:351 UCX DIAG [cep 0x1bea840 127.0.0.1:55822->127.0.0.1:2998 client Success]: rdma_resolve_route failed: No such device
[1651306477.632511] [swx-ucx01:420 :a] uct_cm.c:99 UCX DIAG resolve callback failed with error: Destination is unreachable
[1651306477.635771] [swx-ucx01:420 :1] ucp_worker.c:1866 UCX INFO ep_cfg[2]: tag(tcp/lo); rma_am(tcp/lo); am(tcp/lo);
[1651306477.636067] [swx-ucx01:420 :2] ucp_worker.c:1866 UCX INFO ep_cfg[3]: tag(xpmem/memory posix/memory cma/memory); rma(xpmem/memory); am(xpmem/memory posix/memory cma/memory);
[139999265670912] Sent 0 messages
[139999282456320] Sent 0 messages
[139999366797056] Sent 0 messages
[139999375189760] Sent [139999240972032] Sent 00 messages messages
[139999290849024] Sent 0 messages
[139999274063616] Sent 0 messages
[139999383582464] Sent 0 messages
[139999265670912] Sent 1000000 messages
[139999366797056] Sent 1000000 messages
[139999265670912] Sent 2000000 messages
[139999366797056] Sent 2000000 messages
[139999265670912] Sent 3000000 messages
[139999375189760] Sent 1000000 messages
[139999282456320] Sent 1000000 messages
[139999240972032] Sent 1000000 messages
[139999366797056] Sent 3000000 messages
[139999265670912] Sent 4000000 messages
[139999290849024] Sent 1000000 messages
[139999282456320] Sent 2000000 messages
[139999366797056] Sent 4000000 messages
[139999282456320] Sent 3000000 messages
[139999375189760] Sent 2000000 messages
[139999383582464] Sent 1000000 messages
[139999290849024] Sent 2000000 messages
[139999282456320] Sent 4000000 messages
[139999240972032] Sent 2000000 messages
[139999383582464] Sent 2000000 messages
[139999383582464] Sent 3000000 messages
[139999290849024] Sent 3000000 messages
[139999383582464] Sent 4000000 messages
[139999290849024] Sent 4000000 messages
[139999375189760] Sent 3000000 messages
[139999274063616] Sent 1000000 messages
[139999240972032] Sent 3000000 messages
[139999240972032] Sent 4000000 messages
[139999375189760] Sent 4000000 messages
[139999274063616] Sent 2000000 messages
[139999274063616] Sent 3000000 messages
[139999274063616] Sent 4000000 messages
>> Finished. Shutting down...
Could pls send an output with UCX_LOG_LEVEL=info
set? I'm going to compare our outputs to try understand that difference between our tests.
~Sure, here is the output:~
Edit
I updated the demo to use UCX's logging functions for a more readable output : https://gist.github.com/krakowski/8a30c6bf3e94b727e9e1ac268cbb8045
Here's the updated output:
Client Side
$ UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./openucx_examples client
[1651312607.198865] [alpha:9936 :0] main.cpp:210 UCX INFO Initializing context...
[1651312607.198894] [alpha:9936 :0] ucp_context.c:1778 UCX INFO UCP version is 1.12 (release 1)
[1651312607.202592] [alpha:9936 :0] main.cpp:223 UCX INFO Creating worker...
[1651312607.207853] [alpha:9936 :0] rdmacm_cm.c:917 UCX DIAG rdma_create_event_channel failed: No such device
[1651312607.207877] [alpha:9936 :0] ucp_worker.c:1361 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651312607.207928] [alpha:9936 :0] parser.c:1914 UCX INFO UCX_* env variables: UCX_TLS=shm,tcp UCX_LOG_LEVEL=info
[1651312607.207965] [alpha:9936 :0] main.cpp:72 UCX INFO Initializing endpoint...
[1651312607.208193] [alpha:9936 :0] main.cpp:89 UCX INFO Endpoint initialized
[1651312607.219992] [alpha:9936 :1] main.cpp:54 UCX INFO Sending messages...
[1651312607.219994] [alpha:9936 :2] main.cpp:54 UCX INFO Sending messages...
[1651312607.220157] [alpha:9936 :4] main.cpp:54 UCX INFO Sending messages...
[1651312607.220000] [alpha:9936 :3] main.cpp:54 UCX INFO Sending messages...
[1651312607.222179] [alpha:9936 :5] main.cpp:54 UCX INFO Sending messages...
[1651312607.225061] [alpha:9936 :6] main.cpp:54 UCX INFO Sending messages...
[1651312607.232107] [alpha:9936 :7] main.cpp:54 UCX INFO Sending messages...
[1651312607.238712] [alpha:9936 :8] main.cpp:54 UCX INFO Sending messages...
[1651312607.266438] [alpha:9936 :1] ucp_worker.c:1866 UCX INFO ep_cfg[2]: tag(tcp/lo); rma_am(tcp/lo); am(tcp/lo);
[1651312607.272714] [alpha:9936 :8] ucp_worker.c:1866 UCX INFO ep_cfg[3]: tag(posix/memory cma/memory); rma(posix/memory sysv/memory); am(posix/memory cma/memory);
[alpha:9936 :8:9944] proto_am.inl:257 Assertion `max_iov > 1' failed
==== backtrace (tid: 9944) ====
0 /usr/lib/libucs.so.0(ucs_handle_error+0x2c4) [0x7faf78230404]
1 /usr/lib/libucs.so.0(ucs_fatal_error_message+0xc9) [0x7faf7822cf09]
2 /usr/lib/libucs.so.0(ucs_fatal_error_format+0xe1) [0x7faf7822cff1]
3 /usr/lib/libucp.so.0(+0x20f18) [0x7faf78292f18]
4 /usr/lib/libucp.so.0(ucp_wireup_replay_pending_requests+0x49) [0x7faf78366839]
5 /usr/lib/libucp.so.0(ucp_wireup_init_lanes+0x1822) [0x7faf783682c2]
6 /usr/lib/libucp.so.0(+0xf7ba5) [0x7faf78369ba5]
7 /usr/lib/libucp.so.0(+0xf817e) [0x7faf7836a17e]
8 /usr/lib/libuct.so.0(+0x2752f) [0x7faf77ca252f]
9 /usr/lib/libuct.so.0(+0x2ce66) [0x7faf77ca7e66]
10 /usr/lib/libucs.so.0(ucs_event_set_wait+0x101) [0x7faf7823eeb1]
11 /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x8f) [0x7faf77ca6aaf]
12 /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7faf782c431a]
13 ./openucx_examples(+0x23b6) [0x561e2b1a63b6]
14 ./openucx_examples(+0x24b2) [0x561e2b1a64b2]
15 ./openucx_examples(+0x4041) [0x561e2b1a8041]
16 ./openucx_examples(+0x3f3c) [0x561e2b1a7f3c]
17 ./openucx_examples(+0x3e33) [0x561e2b1a7e33]
18 ./openucx_examples(+0x3db4) [0x561e2b1a7db4]
19 ./openucx_examples(+0x3d98) [0x561e2b1a7d98]
20 /usr/lib/libstdc++.so.6(+0xd54d4) [0x7faf780ab4d4]
21 /usr/lib/libc.so.6(+0x8d5c2) [0x7faf77d565c2]
22 /usr/lib/libc.so.6(clone+0x44) [0x7faf77ddb584]
=================================
[alpha:9936 :8:9944] Process frozen, press Enter to attach a debugger...
Server Side
$ UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./openucx_examples server
[1651312601.072455] [alpha:9933 :0] main.cpp:210 UCX INFO Initializing context...
[1651312601.072496] [alpha:9933 :0] ucp_context.c:1778 UCX INFO UCP version is 1.12 (release 1)
[1651312601.078238] [alpha:9933 :0] main.cpp:223 UCX INFO Creating worker...
[1651312601.086293] [alpha:9933 :0] rdmacm_cm.c:917 UCX DIAG rdma_create_event_channel failed: No such device
[1651312601.086315] [alpha:9933 :0] ucp_worker.c:1361 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651312601.086375] [alpha:9933 :0] parser.c:1914 UCX INFO UCX_* env variables: UCX_TLS=shm,tcp UCX_LOG_LEVEL=info
[1651312601.086417] [alpha:9933 :0] main.cpp:111 UCX INFO Registering active message handler...
[1651312601.086425] [alpha:9933 :0] main.cpp:135 UCX INFO Initializing listener...
[1651312601.086472] [alpha:9933 :0] main.cpp:161 UCX INFO Waiting on client connection...
[1651312607.249639] [alpha:9933 :0] main.cpp:168 UCX INFO Accepting client connection...
[1651312607.249892] [alpha:9933 :0] ucp_worker.c:1866 UCX INFO ep_cfg[0]: tag(tcp/lo); rma_am(tcp/lo); am(tcp/lo);
[1651312607.257064] [alpha:9933 :0] main.cpp:180 UCX INFO Connection established
@krakowski thansk for collecting inforamtion and helping me with the reproduction!
I still can't reproduce the issue locally, but I understand what's wrong now.
Actually, it seems that it is not related to the multi-threaded usage of UCX. The problem is reconfiguration from TCP to SHM transport which don't support ZCOPY protocol. So, messages submitted before recinfiguring to SHM choose ZCOPY. because TCP supports this protocol. But then they can't be sent through SHM. Hence, we got reasonable assertion that max_iov
is 0
.
I prepared a possible solution which could help you - https://github.com/dmitrygx/ucx/tree/topic/ucp/ep_reconfig. Could you try it and share a feedback, please?
@dmitrygx Your solution works! I noticed that the first transport lane after reconfiguration changed from posix/memory
to sysv/memory
. Here's the output:
Client Side
$ UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./openucx_examples client
[1651415190.175301] [alpha:35406:0] main.cpp:210 UCX INFO Initializing context...
[1651415190.175345] [alpha:35406:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /usr/lib/libucp.so.0)
[1651415190.179184] [alpha:35406:0] main.cpp:223 UCX INFO Creating worker...
[1651415190.185263] [alpha:35406:0] rdmacm_cm.c:928 UCX DIAG rdma_create_event_channel failed: No such device
[1651415190.185281] [alpha:35406:0] ucp_worker.c:1462 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651415190.185320] [alpha:35406:0] parser.c:1916 UCX INFO UCX_* env variables: UCX_TLS=shm,tcp UCX_LOG_LEVEL=info
[1651415190.185346] [alpha:35406:0] main.cpp:72 UCX INFO Initializing endpoint...
[1651415190.185536] [alpha:35406:0] main.cpp:89 UCX INFO Endpoint initialized
[1651415190.199061] [alpha:35406:1] main.cpp:54 UCX INFO Sending messages...
[1651415190.199061] [alpha:35406:2] main.cpp:54 UCX INFO Sending messages...
[1651415190.199684] [alpha:35406:2] ucp_worker.c:1763 UCX INFO ep_cfg[1]: tag(tcp/lo) rma_am(tcp/lo) am(tcp/lo)
[1651415190.200091] [alpha:35406:7] main.cpp:54 UCX INFO Sending messages...
[1651415190.199062] [alpha:35406:3] main.cpp:54 UCX INFO Sending messages...
[1651415190.199071] [alpha:35406:4] main.cpp:54 UCX INFO Sending messages...
[1651415190.199077] [alpha:35406:5] main.cpp:54 UCX INFO Sending messages...
[1651415190.199089] [alpha:35406:6] main.cpp:54 UCX INFO Sending messages...
[1651415190.231238] [alpha:35406:8] main.cpp:54 UCX INFO Sending messages...
[1651415190.271453] [alpha:35406:1] ucp_worker.c:1763 UCX INFO ep_cfg[2]: tag(tcp/lo) rma_am(tcp/lo) am(tcp/lo)
[1651415190.278034] [alpha:35406:1] ucp_worker.c:1763 UCX INFO ep_cfg[3]: tag(sysv/memory cma/memory) rma(sysv/memory posix/memory) am(sysv/memory cma/memory)
[1651415190.279599] [alpha:35406:5] main.cpp:63 UCX INFO Sent 0 messages
[1651415190.279602] [alpha:35406:1] main.cpp:63 UCX INFO Sent 0 messages
[1651415190.280657] [alpha:35406:2] main.cpp:63 UCX INFO Sent 0 messages
[1651415190.281234] [alpha:35406:7] main.cpp:63 UCX INFO Sent 0 messages
[1651415190.285882] [alpha:35406:8] main.cpp:63 UCX INFO Sent 0 messages
[1651415190.285882] [alpha:35406:6] main.cpp:63 UCX INFO Sent 0 messages
[1651415190.324573] [alpha:35406:4] main.cpp:63 UCX INFO Sent 0 messages
[1651415190.384571] [alpha:35406:3] main.cpp:63 UCX INFO Sent 0 messages
[1651415194.693555] [alpha:35406:2] main.cpp:63 UCX INFO Sent 1000000 messages
[1651415197.677911] [alpha:35406:1] main.cpp:63 UCX INFO Sent 1000000 messages
[1651415198.418036] [alpha:35406:4] main.cpp:63 UCX INFO Sent 1000000 messages
[1651415198.886526] [alpha:35406:6] main.cpp:63 UCX INFO Sent 1000000 messages
[1651415199.114267] [alpha:35406:3] main.cpp:63 UCX INFO Sent 1000000 messages
[1651415199.166545] [alpha:35406:5] main.cpp:63 UCX INFO Sent 1000000 messages
[1651415199.724646] [alpha:35406:8] main.cpp:63 UCX INFO Sent 1000000 messages
[1651415200.636522] [alpha:35406:7] main.cpp:63 UCX INFO Sent 1000000 messages
[1651415200.882846] [alpha:35406:2] main.cpp:63 UCX INFO Sent 2000000 messages
[1651415204.444358] [alpha:35406:4] main.cpp:63 UCX INFO Sent 2000000 messages
[1651415204.816403] [alpha:35406:5] main.cpp:63 UCX INFO Sent 2000000 messages
[1651415206.214550] [alpha:35406:1] main.cpp:63 UCX INFO Sent 2000000 messages
[1651415206.385602] [alpha:35406:8] main.cpp:63 UCX INFO Sent 2000000 messages
[1651415206.975833] [alpha:35406:2] main.cpp:63 UCX INFO Sent 3000000 messages
[1651415207.601522] [alpha:35406:3] main.cpp:63 UCX INFO Sent 2000000 messages
[1651415208.547422] [alpha:35406:6] main.cpp:63 UCX INFO Sent 2000000 messages
[1651415208.750911] [alpha:35406:7] main.cpp:63 UCX INFO Sent 2000000 messages
[1651415211.744112] [alpha:35406:5] main.cpp:63 UCX INFO Sent 3000000 messages
[1651415212.593172] [alpha:35406:1] main.cpp:63 UCX INFO Sent 3000000 messages
[1651415213.877344] [alpha:35406:8] main.cpp:63 UCX INFO Sent 3000000 messages
[1651415214.351052] [alpha:35406:4] main.cpp:63 UCX INFO Sent 3000000 messages
[1651415214.617614] [alpha:35406:3] main.cpp:63 UCX INFO Sent 3000000 messages
[1651415215.926639] [alpha:35406:7] main.cpp:63 UCX INFO Sent 3000000 messages
[1651415215.929791] [alpha:35406:6] main.cpp:63 UCX INFO Sent 3000000 messages
[1651415216.351539] [alpha:35406:2] main.cpp:63 UCX INFO Sent 4000000 messages
[1651415217.607853] [alpha:35406:2] main.cpp:67 UCX INFO Finished sending messages
[1651415218.577598] [alpha:35406:5] main.cpp:63 UCX INFO Sent 4000000 messages
[1651415218.647513] [alpha:35406:4] main.cpp:63 UCX INFO Sent 4000000 messages
[1651415219.015270] [alpha:35406:5] main.cpp:67 UCX INFO Finished sending messages
[1651415219.259772] [alpha:35406:7] main.cpp:63 UCX INFO Sent 4000000 messages
[1651415219.407499] [alpha:35406:4] main.cpp:67 UCX INFO Finished sending messages
[1651415219.427268] [alpha:35406:1] main.cpp:63 UCX INFO Sent 4000000 messages
[1651415219.530426] [alpha:35406:7] main.cpp:67 UCX INFO Finished sending messages
[1651415219.668728] [alpha:35406:1] main.cpp:67 UCX INFO Finished sending messages
[1651415219.710800] [alpha:35406:6] main.cpp:63 UCX INFO Sent 4000000 messages
[1651415219.767727] [alpha:35406:3] main.cpp:63 UCX INFO Sent 4000000 messages
[1651415219.819378] [alpha:35406:6] main.cpp:67 UCX INFO Finished sending messages
[1651415219.825500] [alpha:35406:3] main.cpp:67 UCX INFO Finished sending messages
[1651415219.832556] [alpha:35406:8] main.cpp:63 UCX INFO Sent 4000000 messages
[1651415219.846594] [alpha:35406:8] main.cpp:67 UCX INFO Finished sending messages
[1651415219.846683] [alpha:35406:0] main.cpp:106 UCX INFO Finished. Shutting down...
Server Side
$ UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./openucx_examples server
[1651415187.472526] [alpha:35403:0] main.cpp:210 UCX INFO Initializing context...
[1651415187.472564] [alpha:35403:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /usr/lib/libucp.so.0)
[1651415187.478646] [alpha:35403:0] main.cpp:223 UCX INFO Creating worker...
[1651415187.484677] [alpha:35403:0] rdmacm_cm.c:928 UCX DIAG rdma_create_event_channel failed: No such device
[1651415187.484693] [alpha:35403:0] ucp_worker.c:1462 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651415187.484730] [alpha:35403:0] parser.c:1916 UCX INFO UCX_* env variables: UCX_TLS=shm,tcp UCX_LOG_LEVEL=info
[1651415187.484754] [alpha:35403:0] main.cpp:111 UCX INFO Registering active message handler...
[1651415187.484762] [alpha:35403:0] main.cpp:135 UCX INFO Initializing listener...
[1651415187.484791] [alpha:35403:0] main.cpp:161 UCX INFO Waiting on client connection...
[1651415190.251460] [alpha:35403:0] main.cpp:168 UCX INFO Accepting client connection...
[1651415190.251582] [alpha:35403:0] ucp_worker.c:1763 UCX INFO ep_cfg[0]: tag(tcp/lo) rma_am(tcp/lo) am(tcp/lo)
[1651415190.252798] [alpha:35403:0] main.cpp:180 UCX INFO Connection established
[1651415190.278317] [alpha:35403:0] ucp_worker.c:1763 UCX INFO ep_cfg[1]: tag(sysv/memory cma/memory) rma(sysv/memory posix/memory) am(sysv/memory cma/memory)
[1651415219.846685] [alpha:35403:0] main.cpp:187 UCX INFO Received last message. Shutting down...
@krakowski thanks for testing it! I'll think about upstreaming it.
I noticed that the first transport lane after reconfiguration changed from posix/memory to sysv/memory.
yeah, interesting... I compared locally outputs of the master branch and https://github.com/dmitrygx/ucx/tree/topic/ucp/ep_reconfig. Thay both provide me the same transports.
Is it possible to collect the output from the following command for master, v1.12.1, and my changed on your machine, pls?
UCX_TLS=tcp,shm UCX_LOG_LEVEL=info ./install/master/bin/ucx_info -eptw -u tra -P intra -A 127.0.0.1
It should help me to understand why UCX changes selection logic. Probably, the difference between the latest UCX and v1.12.1.
Sure, I'm glad I can help :slightly_smiling_face:
master
$ UCX_TLS=tcp,shm UCX_LOG_LEVEL=info ucx_info -eptw -u tra -P intra -A 127.0.0.1
[1651489956.267674] [alpha:33246:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /usr/lib/libucp.so.0)
#
# UCP context
#
# component 0 : self
# component 1 : tcp
# component 2 : sysv
# component 3 : posix
# component 4 : ib
# component 5 : rdmacm
# component 6 : cma
#
# md 0 : component 1 tcp
# md 1 : component 2 sysv
# md 2 : component 3 posix
# md 3 : component 6 cma
#
# resource 0 : md 0 dev 0 flags -- tcp/lo
# resource 1 : md 0 dev 1 flags -- tcp/wlp4s0
# resource 2 : md 1 dev 2 flags -- sysv/memory
# resource 3 : md 2 dev 2 flags -- posix/memory
# resource 4 : md 3 dev 2 flags -- cma/memory
#
# memory: 1.71MB, file descriptors: 6
# create time: 5.444 ms
#
[1651489956.279711] [alpha:33246:0] rdmacm_cm.c:928 UCX DIAG rdma_create_event_channel failed: No such device
[1651489956.279730] [alpha:33246:0] ucp_worker.c:1462 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651489956.279782] [alpha:33246:0] parser.c:1916 UCX INFO UCX_* env variables: UCX_TLS=tcp,shm UCX_LOG_LEVEL=info
[1651489956.286654] [alpha:33246:0] rdmacm_cm.c:928 UCX DIAG rdma_create_event_channel failed: No such device
[1651489956.286672] [alpha:33246:0] ucp_worker.c:1462 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651489956.287403] [alpha:33246:0] ucp_worker.c:1763 UCX INFO ep_cfg[1]: tag(tcp/lo) rma_am(tcp/lo) amo_am(tcp/lo)
[1651489956.288636] [alpha:33246:0] ucp_worker.c:1763 UCX INFO ep_cfg[2]: tag(tcp/lo) rma_am(tcp/lo) amo_am(tcp/lo)
[1651489956.289137] [alpha:33246:0] ucp_worker.c:1763 UCX INFO ep_cfg[3]: tag(sysv/memory cma/memory) rma(sysv/memory posix/memory) amo(sysv/memory posix/memory)
#
# UCP endpoint
#
# peer: 127.0.0.1:48617
# lane[0]: cm rdmacm
# lane[1]: 2:sysv/memory.0 md[1] -> md[1]/sysv/sysdev[255] amo#0 am am_bw#0
# lane[2]: 3:posix/memory.0 md[2] -> md[2]/posix/sysdev[255] amo#1
# lane[3]: 4:cma/memory.0 md[3] -> md[3]/cma/sysdev[255] rma_bw#0
#
# tag_send: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..93..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
# put[1]: 0..<short>..4294967296..<bcopy>..(inf)
# get[1]: 0..<bcopy>..(inf)
# put[2]: 0..<short>..4294967296..<bcopy>..(inf)
# get[2]: 0..<bcopy>..(inf)
#
# rma_bw: mds rndv_rkey_size 9
#
[1651489956.290599] [alpha:33246:0] ucp_ep.c:1402 UCX DIAG ep 0x7f87ec29e0a0: error 'Connection reset by remote peer' on CM lane will not be handled since no error callback is installed
v1.12.1
$ UCX_TLS=tcp,shm UCX_LOG_LEVEL=info ucx_info -eptw -u tra -P intra -A 127.0.0.1
[1651490470.830512] [alpha:61762:0] ucp_context.c:1778 UCX INFO UCP version is 1.12 (release 1)
#
# UCP context
#
# component 0 : posix
# component 1 : sysv
# component 2 : self
# component 3 : tcp
# component 4 : ib
# component 5 : rdmacm
# component 6 : cma
#
# md 0 : component 0 posix
# md 1 : component 1 sysv
# md 2 : component 3 tcp
# md 3 : component 6 cma
#
# resource 0 : md 0 dev 0 flags -- posix/memory
# resource 1 : md 1 dev 0 flags -- sysv/memory
# resource 2 : md 2 dev 1 flags -- tcp/lo
# resource 3 : md 2 dev 2 flags -- tcp/wlp4s0
# resource 4 : md 3 dev 0 flags -- cma/memory
#
# memory: 1.59MB, file descriptors: 0
# create time: 5.597 ms
#
[1651490470.843673] [alpha:61762:0] rdmacm_cm.c:917 UCX DIAG rdma_create_event_channel failed: No such device
[1651490470.843693] [alpha:61762:0] ucp_worker.c:1361 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651490470.843749] [alpha:61762:0] parser.c:1914 UCX INFO UCX_* env variables: UCX_TLS=tcp,shm UCX_LOG_LEVEL=info
[1651490470.850027] [alpha:61762:0] rdmacm_cm.c:917 UCX DIAG rdma_create_event_channel failed: No such device
[1651490470.850049] [alpha:61762:0] ucp_worker.c:1361 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651490470.852298] [alpha:61762:0] ucp_worker.c:1866 UCX INFO ep_cfg[2]: tag(tcp/lo); rma_am(tcp/lo); amo_am(tcp/lo);
[1651490470.852697] [alpha:61762:0] ucp_worker.c:1866 UCX INFO ep_cfg[3]: tag(posix/memory cma/memory); rma(posix/memory sysv/memory); amo(posix/memory sysv/memory);
#
# UCP endpoint
#
# peer: 127.0.0.1:56329
# lane[0]: cm rdmacm
# lane[1]: 0:posix/memory.0 md[0] -> md[0]/posix/sysdev[255] amo#0 am am_bw#0
# lane[2]: 1:sysv/memory.0 md[1] -> md[1]/sysv/sysdev[255] amo#1
# lane[3]: 4:cma/memory.0 md[3] -> md[3]/cma/sysdev[255] rma_bw#0
#
# tag_send: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..93..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
# put[1]: 0..<short>..4294967296..<bcopy>..(inf)
# get[1]: 0..<bcopy>..(inf)
# put[2]: 0..<short>..4294967296..<bcopy>..(inf)
# get[2]: 0..<bcopy>..(inf)
#
# rma_bw: mds rndv_rkey_size 9
#
[1651490470.854063] [alpha:61762:0] ucp_ep.c:1243 UCX DIAG ep 0x7fe859eae098: error 'Connection reset by remote peer' on CM lane will not be handled since no error callback is installed
topic/ucp/ep_reconfig
$ UCX_TLS=tcp,shm UCX_LOG_LEVEL=info ucx_info -eptw -u tra -P intra -A 127.0.0.1
[1651489747.211041] [alpha:28210:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /usr/lib/libucp.so.0)
#
# UCP context
#
# component 0 : self
# component 1 : tcp
# component 2 : sysv
# component 3 : posix
# component 4 : ib
# component 5 : rdmacm
# component 6 : cma
#
# md 0 : component 1 tcp
# md 1 : component 2 sysv
# md 2 : component 3 posix
# md 3 : component 6 cma
#
# resource 0 : md 0 dev 0 flags -- tcp/lo
# resource 1 : md 0 dev 1 flags -- tcp/wlp4s0
# resource 2 : md 1 dev 2 flags -- sysv/memory
# resource 3 : md 2 dev 2 flags -- posix/memory
# resource 4 : md 3 dev 2 flags -- cma/memory
#
# memory: 1.57MB, file descriptors: 5
# create time: 4.408 ms
#
[1651489747.221065] [alpha:28210:0] rdmacm_cm.c:928 UCX DIAG rdma_create_event_channel failed: No such device
[1651489747.226928] [alpha:28210:0] ucp_worker.c:1462 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651489747.227019] [alpha:28210:0] parser.c:1916 UCX INFO UCX_* env variables: UCX_TLS=tcp,shm UCX_LOG_LEVEL=info
[1651489747.232854] [alpha:28210:0] rdmacm_cm.c:928 UCX DIAG rdma_create_event_channel failed: No such device
[1651489747.240294] [alpha:28210:0] ucp_worker.c:1462 UCX DIAG failed to open CM on component rdmacm with status Input/output error
[1651489747.241099] [alpha:28210:0] ucp_worker.c:1763 UCX INFO ep_cfg[1]: tag(tcp/lo) rma_am(tcp/lo) amo_am(tcp/lo)
[1651489747.242243] [alpha:28210:0] ucp_worker.c:1763 UCX INFO ep_cfg[2]: tag(tcp/lo) rma_am(tcp/lo) amo_am(tcp/lo)
[1651489747.242732] [alpha:28210:0] ucp_worker.c:1763 UCX INFO ep_cfg[3]: tag(sysv/memory cma/memory) rma(sysv/memory posix/memory) amo(sysv/memory posix/memory)
#
# UCP endpoint
#
# peer: 127.0.0.1:58117
# lane[0]: cm rdmacm
# lane[1]: 2:sysv/memory.0 md[1] -> md[1]/sysv/sysdev[255] amo#0 am am_bw#0
# lane[2]: 3:posix/memory.0 md[2] -> md[2]/posix/sysdev[255] amo#1
# lane[3]: 4:cma/memory.0 md[3] -> md[3]/cma/sysdev[255] rma_bw#0
#
# tag_send: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..93..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
# put[1]: 0..<short>..4294967296..<bcopy>..(inf)
# get[1]: 0..<bcopy>..(inf)
# put[2]: 0..<short>..4294967296..<bcopy>..(inf)
# get[2]: 0..<bcopy>..(inf)
#
# rma_bw: mds rndv_rkey_size 9
#
[1651489747.244383] [alpha:28210:0] ucp_ep.c:1402 UCX DIAG ep 0x7f114e76e0a0: error 'Connection reset by remote peer' on CM lane will not be handled since no error callback is installed
@krakowski thanks you! It is very helpful. yes, v1.12.1 has a difference with selection between sysv and posix in compare to master and my local branch. It seems that something was changed in the latest UCX.