ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Failed assertion using active messaging in multi-threaded environment

Open krakowski opened this issue 2 years ago • 15 comments

Describe the bug

Using active messaging (ucp_am_send_nbx) on the same ~(TCP)~ endpoint from withing two different threads results in a failed assertion, although the library is compiled with multithreading support and all corresponding flags are set.

proto_am.inl:257  Assertion `max_iov > 1' failed
==== backtrace (tid:  33396) ====
 0  /usr/lib/libucs.so.0(ucs_handle_error+0x2c4) [0x7ffff786a404]
 1  /usr/lib/libucs.so.0(ucs_fatal_error_message+0xc9) [0x7ffff7866f09]
 2  /usr/lib/libucs.so.0(ucs_fatal_error_format+0xe1) [0x7ffff7866ff1]
 3  /usr/lib/libucp.so.0(+0x20f18) [0x7ffff7e4df18]
 4  /usr/lib/libucp.so.0(ucp_wireup_replay_pending_requests+0x49) [0x7ffff7f21839]
 5  /usr/lib/libucp.so.0(ucp_wireup_init_lanes+0x1822) [0x7ffff7f232c2]
 6  /usr/lib/libucp.so.0(+0xf7ba5) [0x7ffff7f24ba5]
 7  /usr/lib/libucp.so.0(+0xf817e) [0x7ffff7f2517e]
 8  /usr/lib/libuct.so.0(+0x2752f) [0x7ffff78d552f]
 9  /usr/lib/libuct.so.0(+0x2ce66) [0x7ffff78dae66]
10  /usr/lib/libucs.so.0(ucs_event_set_wait+0x101) [0x7ffff7878eb1]
11  /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x8f) [0x7ffff78d9aaf]
12  /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7ffff7e7f31a]

Context Parameters

ucp_params_t context_params{};
context_params.features          = UCP_FEATURE_AM;
context_params.mt_workers_shared = 1;
context_params.field_mask        = UCP_PARAM_FIELD_FEATURES | UCP_PARAM_FIELD_MT_WORKERS_SHARED;

Worker Parameters

ucp_worker_params_t worker_params{};
worker_params.thread_mode = UCS_THREAD_MODE_MULTI;
worker_params.field_mask  = UCP_WORKER_PARAM_FIELD_THREAD_MODE;

Steps to Reproduce

  • UCX version 1.12.1
  • Configured with contrib/configure-devel (which contains --enable-mt)

Setup and versions

  • TCP Connection (no InfiniBand hardware)
  • Linux 5.17.1-3-MANJARO (x86_64)
  • rdma-core 39.0

krakowski avatar Apr 28 '22 17:04 krakowski

I created a small minimal working example program where each of two threads sends messages using ucp_am_send_nbx.

https://gist.github.com/krakowski/8a30c6bf3e94b727e9e1ac268cbb8045

The assertion does not always fail, as it works most of the time. If it fails, then it is almost always at the beginning of the program which makes me think if this is related to connection establishment happening at the same time from both threads due to bad timing. Attaching gdb after the crash shows the following.

Screenshot from 2022-04-29 17-09-52

krakowski avatar Apr 29 '22 15:04 krakowski

@krakowski I tried to reproduce the issue using you reproducer and setting UCX_TLS=tcp. But it works fine with 1.12.1 and master.

dmitrygx avatar Apr 29 '22 19:04 dmitrygx

@dmitrygx Thanks for looking into this!

It works for me most of the time, too. I'm not really sure how I can reproduce this reliably. If I hit the right timing it crashes with the mentioned failed assertion.

I noticed that the connection gets established as soon as the first message is sent out. Could there be some synchronization error regarding connection establishment happening in two separate threads at the same time?

krakowski avatar Apr 29 '22 19:04 krakowski

It works for me most of the time, too.

how many iterations does it take to catch this?

I noticed that the connection gets established as soon as the first message is sent out. Could there be some synchronization error regarding connection establishment happening in two separate threads at the same time?

yes, it could be, but I checked that we acquire locks when processing connection packets and as a result reposting user's AM operations.

dmitrygx avatar Apr 29 '22 20:04 dmitrygx

@dmitrygx I just updated my gist. This version crashes every time I try to run it.

https://gist.github.com/krakowski/8a30c6bf3e94b727e9e1ac268cbb8045

Things I changed:

  • Use the same buffer for ucp_am_send_nbxs header and buffer parameters. In the first version I just sent the header.
  • Use 8 Threads for sending messages.
  • Make all threads wait just in front of the first ucp_am_send_nbx call.

krakowski avatar Apr 29 '22 20:04 krakowski

@krakowski thanks! but it still works fine using this dc9243556ec10178fc9fc2f8be5e17f2bf79cd19 commit. see the output of the client (it is mixed due to threads contention):

$UCX_TLS=tcp ./ucp_am_mt client
>> Initializing context
>> Creating worker
>> Establishing connection
>> Connection established
>> Sending messages
[[[[140596752672512[140596744279808] Sent 140596710708992] Sent ] Sent 140596838344448] Sent 0 messages0 messages
140596719101696] Sent 0 messages
0 messages

[140596727494400] Sent 0 messages0 messages

[140596761065216] Sent 0 messages
[140596735887104] Sent 0 messages
[140596838344448] Sent 1000000 messages
[140596719101696] Sent 1000000 messages
[140596761065216] Sent 1000000 messages
[140596761065216] Sent 2000000 messages
[140596838344448] Sent 2000000 messages
[140596719101696] Sent 2000000 messages
[140596727494400] Sent 1000000 messages
[140596752672512] Sent 1000000 messages
[140596719101696] Sent 3000000 messages
[140596761065216] Sent 3000000 messages
[140596838344448] Sent 3000000 messages
[140596719101696] Sent 4000000 messages
[140596727494400] Sent 2000000 messages
[140596761065216] Sent 4000000 messages
[140596838344448] Sent 4000000 messages
[140596752672512] Sent 2000000 messages
[140596710708992] Sent 1000000 messages
[140596727494400] Sent 3000000 messages
[140596710708992] Sent 2000000 messages
[140596710708992] Sent 3000000 messages
[140596710708992] Sent 4000000 messages
[140596727494400] Sent 4000000 messages
[140596744279808] Sent 1000000 messages
[140596752672512] Sent 3000000 messages
[140596744279808] Sent 2000000 messages
[140596744279808] Sent 3000000 messages
[140596744279808] Sent 4000000 messages
[140596752672512] Sent 4000000 messages
[140596735887104] Sent 1000000 messages
[140596735887104] Sent 2000000 messages
[140596735887104] Sent 3000000 messages
[140596735887104] Sent 4000000 messages
>> Finished. Shutting down...

dmitrygx avatar Apr 29 '22 20:04 dmitrygx

Strange... I will try to debug it further tomorrow to see why the assertion fails on my machine.

I also could not really find out what max_iov stands for. Can you please explain briefly what the assertion checks? This could help me in tracking down what is happening. Thanks for helping out! 🙂

krakowski avatar Apr 29 '22 20:04 krakowski

@dmitrygx I managed to narrow down the problem a bit more. The crash does only occur, when UCX chooses shm and tcp as transports. Running the demo app with UCX_TLS=tcp works for me, too. As soon as I add shm (UCX_TLS=shm,tcp), it crashes immediately. Is this expected to happen with this kind of configuration? I started the demo app without explicitly specifying the transport (assuming it would choose tcp), in which case it obviously choose shm,tcp (due to UCX_TLS=all) automatically based on my local available transports.

$ UCX_TLS=shm,tcp ./openucx_examples client
>> Initializing context
>> Creating worker
>> Establishing connection
>> Connection established
>> Sending messages

proto_am.inl:257  Assertion `max_iov > 1' failed
==== backtrace (tid:   3337) ====
 0  /usr/lib/libucs.so.0(ucs_handle_error+0x2c4) [0x7f568b854404]
 1  /usr/lib/libucs.so.0(ucs_fatal_error_message+0xc9) [0x7f568b850f09]
 2  /usr/lib/libucs.so.0(ucs_fatal_error_format+0xe1) [0x7f568b850ff1]
 3  /usr/lib/libucp.so.0(+0x20f18) [0x7f568be37f18]
 4  /usr/lib/libucp.so.0(ucp_wireup_replay_pending_requests+0x49) [0x7f568bf0b839]
 5  /usr/lib/libucp.so.0(ucp_wireup_init_lanes+0x1822) [0x7f568bf0d2c2]
 6  /usr/lib/libucp.so.0(+0xf7ba5) [0x7f568bf0eba5]
 7  /usr/lib/libucp.so.0(+0xf817e) [0x7f568bf0f17e]
 8  /usr/lib/libuct.so.0(+0x2752f) [0x7f568b8bf52f]
 9  /usr/lib/libuct.so.0(+0x2ce66) [0x7f568b8c4e66]
10  /usr/lib/libucs.so.0(ucs_event_set_wait+0x101) [0x7f568b862eb1]
11  /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x8f) [0x7f568b8c3aaf]
12  /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7f568be6931a]
13  ./openucx_examples(+0x23ac) [0x55929bda73ac]
14  ./openucx_examples(+0x245c) [0x55929bda745c]
15  ./openucx_examples(+0x3e4d) [0x55929bda8e4d]
16  ./openucx_examples(+0x3d48) [0x55929bda8d48]
17  ./openucx_examples(+0x3c3f) [0x55929bda8c3f]
18  ./openucx_examples(+0x3bc0) [0x55929bda8bc0]
19  ./openucx_examples(+0x3ba4) [0x55929bda8ba4]
20  /usr/lib/libstdc++.so.6(+0xd54d4) [0x7f568bcc64d4]
21  /usr/lib/libc.so.6(+0x8d5c2) [0x7f568b9715c2]
22  /usr/lib/libc.so.6(clone+0x44) [0x7f568b9f6584]
=================================

krakowski avatar Apr 30 '22 08:04 krakowski

I also could not really find out what max_iov stands for. Can you please explain briefly what the assertion checks?

it expects that transport's supported number of IOVs will be greater than 1, since we need to pack user's header to the IOV. but the problem not in the max_iov, the problem is in EP reconfiguration from one thread while another thread relies on the previous configuration.

The crash does only occur, when UCX chooses shm and tcp as transports.

I tried UCX_TLS=shm,tcp, but it still works fine :)

$UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./ucp_am_mt client
>> Initializing context
[1651306477.580326] [swx-ucx01:420  :0]     ucp_context.c:1778 UCX  INFO  UCP version is 1.12 (release 1)
>> Creating worker
[1651306477.626438] [swx-ucx01:420  :0]          parser.c:1914 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_TLS=shm,tcp
>> Establishing connection
>> Connection established
>> Sending messages
[1651306477.632434] [swx-ucx01:420  :a]       rdmacm_cm.c:351  UCX  DIAG  [cep 0x1bea840 127.0.0.1:55822->127.0.0.1:2998 client Success]: rdma_resolve_route failed: No such device
[1651306477.632511] [swx-ucx01:420  :a]          uct_cm.c:99   UCX  DIAG  resolve callback failed with error: Destination is unreachable
[1651306477.635771] [swx-ucx01:420  :1]      ucp_worker.c:1866 UCX  INFO      ep_cfg[2]: tag(tcp/lo); rma_am(tcp/lo); am(tcp/lo);
[1651306477.636067] [swx-ucx01:420  :2]      ucp_worker.c:1866 UCX  INFO    ep_cfg[3]: tag(xpmem/memory posix/memory cma/memory); rma(xpmem/memory); am(xpmem/memory posix/memory cma/memory);
[139999265670912] Sent 0 messages
[139999282456320] Sent 0 messages
[139999366797056] Sent 0 messages
[139999375189760] Sent [139999240972032] Sent 00 messages messages

[139999290849024] Sent 0 messages
[139999274063616] Sent 0 messages
[139999383582464] Sent 0 messages
[139999265670912] Sent 1000000 messages
[139999366797056] Sent 1000000 messages
[139999265670912] Sent 2000000 messages
[139999366797056] Sent 2000000 messages
[139999265670912] Sent 3000000 messages
[139999375189760] Sent 1000000 messages
[139999282456320] Sent 1000000 messages
[139999240972032] Sent 1000000 messages
[139999366797056] Sent 3000000 messages
[139999265670912] Sent 4000000 messages
[139999290849024] Sent 1000000 messages
[139999282456320] Sent 2000000 messages
[139999366797056] Sent 4000000 messages
[139999282456320] Sent 3000000 messages
[139999375189760] Sent 2000000 messages
[139999383582464] Sent 1000000 messages
[139999290849024] Sent 2000000 messages
[139999282456320] Sent 4000000 messages
[139999240972032] Sent 2000000 messages
[139999383582464] Sent 2000000 messages
[139999383582464] Sent 3000000 messages
[139999290849024] Sent 3000000 messages
[139999383582464] Sent 4000000 messages
[139999290849024] Sent 4000000 messages
[139999375189760] Sent 3000000 messages
[139999274063616] Sent 1000000 messages
[139999240972032] Sent 3000000 messages
[139999240972032] Sent 4000000 messages
[139999375189760] Sent 4000000 messages
[139999274063616] Sent 2000000 messages
[139999274063616] Sent 3000000 messages
[139999274063616] Sent 4000000 messages
>> Finished. Shutting down...

Could pls send an output with UCX_LOG_LEVEL=info set? I'm going to compare our outputs to try understand that difference between our tests.

dmitrygx avatar Apr 30 '22 08:04 dmitrygx

~Sure, here is the output:~

Edit


I updated the demo to use UCX's logging functions for a more readable output : https://gist.github.com/krakowski/8a30c6bf3e94b727e9e1ac268cbb8045

Here's the updated output:

Client Side

$ UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./openucx_examples client
[1651312607.198865] [alpha:9936 :0]          main.cpp:210  UCX  INFO  Initializing context...
[1651312607.198894] [alpha:9936 :0]     ucp_context.c:1778 UCX  INFO  UCP version is 1.12 (release 1)
[1651312607.202592] [alpha:9936 :0]          main.cpp:223  UCX  INFO  Creating worker...
[1651312607.207853] [alpha:9936 :0]       rdmacm_cm.c:917  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651312607.207877] [alpha:9936 :0]      ucp_worker.c:1361 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651312607.207928] [alpha:9936 :0]          parser.c:1914 UCX  INFO  UCX_* env variables: UCX_TLS=shm,tcp UCX_LOG_LEVEL=info
[1651312607.207965] [alpha:9936 :0]          main.cpp:72   UCX  INFO  Initializing endpoint...
[1651312607.208193] [alpha:9936 :0]          main.cpp:89   UCX  INFO  Endpoint initialized
[1651312607.219992] [alpha:9936 :1]          main.cpp:54   UCX  INFO  Sending messages...
[1651312607.219994] [alpha:9936 :2]          main.cpp:54   UCX  INFO  Sending messages...
[1651312607.220157] [alpha:9936 :4]          main.cpp:54   UCX  INFO  Sending messages...
[1651312607.220000] [alpha:9936 :3]          main.cpp:54   UCX  INFO  Sending messages...
[1651312607.222179] [alpha:9936 :5]          main.cpp:54   UCX  INFO  Sending messages...
[1651312607.225061] [alpha:9936 :6]          main.cpp:54   UCX  INFO  Sending messages...
[1651312607.232107] [alpha:9936 :7]          main.cpp:54   UCX  INFO  Sending messages...
[1651312607.238712] [alpha:9936 :8]          main.cpp:54   UCX  INFO  Sending messages...
[1651312607.266438] [alpha:9936 :1]      ucp_worker.c:1866 UCX  INFO      ep_cfg[2]: tag(tcp/lo); rma_am(tcp/lo); am(tcp/lo); 
[1651312607.272714] [alpha:9936 :8]      ucp_worker.c:1866 UCX  INFO    ep_cfg[3]: tag(posix/memory cma/memory); rma(posix/memory sysv/memory); am(posix/memory cma/memory); 
[alpha:9936 :8:9944]  proto_am.inl:257  Assertion `max_iov > 1' failed
==== backtrace (tid:   9944) ====
 0  /usr/lib/libucs.so.0(ucs_handle_error+0x2c4) [0x7faf78230404]
 1  /usr/lib/libucs.so.0(ucs_fatal_error_message+0xc9) [0x7faf7822cf09]
 2  /usr/lib/libucs.so.0(ucs_fatal_error_format+0xe1) [0x7faf7822cff1]
 3  /usr/lib/libucp.so.0(+0x20f18) [0x7faf78292f18]
 4  /usr/lib/libucp.so.0(ucp_wireup_replay_pending_requests+0x49) [0x7faf78366839]
 5  /usr/lib/libucp.so.0(ucp_wireup_init_lanes+0x1822) [0x7faf783682c2]
 6  /usr/lib/libucp.so.0(+0xf7ba5) [0x7faf78369ba5]
 7  /usr/lib/libucp.so.0(+0xf817e) [0x7faf7836a17e]
 8  /usr/lib/libuct.so.0(+0x2752f) [0x7faf77ca252f]
 9  /usr/lib/libuct.so.0(+0x2ce66) [0x7faf77ca7e66]
10  /usr/lib/libucs.so.0(ucs_event_set_wait+0x101) [0x7faf7823eeb1]
11  /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x8f) [0x7faf77ca6aaf]
12  /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7faf782c431a]
13  ./openucx_examples(+0x23b6) [0x561e2b1a63b6]
14  ./openucx_examples(+0x24b2) [0x561e2b1a64b2]
15  ./openucx_examples(+0x4041) [0x561e2b1a8041]
16  ./openucx_examples(+0x3f3c) [0x561e2b1a7f3c]
17  ./openucx_examples(+0x3e33) [0x561e2b1a7e33]
18  ./openucx_examples(+0x3db4) [0x561e2b1a7db4]
19  ./openucx_examples(+0x3d98) [0x561e2b1a7d98]
20  /usr/lib/libstdc++.so.6(+0xd54d4) [0x7faf780ab4d4]
21  /usr/lib/libc.so.6(+0x8d5c2) [0x7faf77d565c2]
22  /usr/lib/libc.so.6(clone+0x44) [0x7faf77ddb584]
=================================
[alpha:9936 :8:9944] Process frozen, press Enter to attach a debugger...

Server Side

$ UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./openucx_examples server
[1651312601.072455] [alpha:9933 :0]          main.cpp:210  UCX  INFO  Initializing context...
[1651312601.072496] [alpha:9933 :0]     ucp_context.c:1778 UCX  INFO  UCP version is 1.12 (release 1)
[1651312601.078238] [alpha:9933 :0]          main.cpp:223  UCX  INFO  Creating worker...
[1651312601.086293] [alpha:9933 :0]       rdmacm_cm.c:917  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651312601.086315] [alpha:9933 :0]      ucp_worker.c:1361 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651312601.086375] [alpha:9933 :0]          parser.c:1914 UCX  INFO  UCX_* env variables: UCX_TLS=shm,tcp UCX_LOG_LEVEL=info
[1651312601.086417] [alpha:9933 :0]          main.cpp:111  UCX  INFO  Registering active message handler...
[1651312601.086425] [alpha:9933 :0]          main.cpp:135  UCX  INFO  Initializing listener...
[1651312601.086472] [alpha:9933 :0]          main.cpp:161  UCX  INFO  Waiting on client connection...
[1651312607.249639] [alpha:9933 :0]          main.cpp:168  UCX  INFO  Accepting client connection...
[1651312607.249892] [alpha:9933 :0]      ucp_worker.c:1866 UCX  INFO    ep_cfg[0]: tag(tcp/lo); rma_am(tcp/lo); am(tcp/lo); 
[1651312607.257064] [alpha:9933 :0]          main.cpp:180  UCX  INFO  Connection established

krakowski avatar Apr 30 '22 09:04 krakowski

@krakowski thansk for collecting inforamtion and helping me with the reproduction! I still can't reproduce the issue locally, but I understand what's wrong now. Actually, it seems that it is not related to the multi-threaded usage of UCX. The problem is reconfiguration from TCP to SHM transport which don't support ZCOPY protocol. So, messages submitted before recinfiguring to SHM choose ZCOPY. because TCP supports this protocol. But then they can't be sent through SHM. Hence, we got reasonable assertion that max_iov is 0.

I prepared a possible solution which could help you - https://github.com/dmitrygx/ucx/tree/topic/ucp/ep_reconfig. Could you try it and share a feedback, please?

dmitrygx avatar May 01 '22 08:05 dmitrygx

@dmitrygx Your solution works! I noticed that the first transport lane after reconfiguration changed from posix/memory to sysv/memory. Here's the output:

Client Side

$ UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./openucx_examples client
[1651415190.175301] [alpha:35406:0]          main.cpp:210  UCX  INFO  Initializing context...
[1651415190.175345] [alpha:35406:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /usr/lib/libucp.so.0)
[1651415190.179184] [alpha:35406:0]          main.cpp:223  UCX  INFO  Creating worker...
[1651415190.185263] [alpha:35406:0]       rdmacm_cm.c:928  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651415190.185281] [alpha:35406:0]      ucp_worker.c:1462 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651415190.185320] [alpha:35406:0]          parser.c:1916 UCX  INFO  UCX_* env variables: UCX_TLS=shm,tcp UCX_LOG_LEVEL=info
[1651415190.185346] [alpha:35406:0]          main.cpp:72   UCX  INFO  Initializing endpoint...
[1651415190.185536] [alpha:35406:0]          main.cpp:89   UCX  INFO  Endpoint initialized
[1651415190.199061] [alpha:35406:1]          main.cpp:54   UCX  INFO  Sending messages...
[1651415190.199061] [alpha:35406:2]          main.cpp:54   UCX  INFO  Sending messages...
[1651415190.199684] [alpha:35406:2]      ucp_worker.c:1763 UCX  INFO  ep_cfg[1]: tag(tcp/lo) rma_am(tcp/lo) am(tcp/lo)
[1651415190.200091] [alpha:35406:7]          main.cpp:54   UCX  INFO  Sending messages...
[1651415190.199062] [alpha:35406:3]          main.cpp:54   UCX  INFO  Sending messages...
[1651415190.199071] [alpha:35406:4]          main.cpp:54   UCX  INFO  Sending messages...
[1651415190.199077] [alpha:35406:5]          main.cpp:54   UCX  INFO  Sending messages...
[1651415190.199089] [alpha:35406:6]          main.cpp:54   UCX  INFO  Sending messages...
[1651415190.231238] [alpha:35406:8]          main.cpp:54   UCX  INFO  Sending messages...
[1651415190.271453] [alpha:35406:1]      ucp_worker.c:1763 UCX  INFO      ep_cfg[2]: tag(tcp/lo) rma_am(tcp/lo) am(tcp/lo)
[1651415190.278034] [alpha:35406:1]      ucp_worker.c:1763 UCX  INFO    ep_cfg[3]: tag(sysv/memory cma/memory) rma(sysv/memory posix/memory) am(sysv/memory cma/memory)
[1651415190.279599] [alpha:35406:5]          main.cpp:63   UCX  INFO  Sent 0 messages
[1651415190.279602] [alpha:35406:1]          main.cpp:63   UCX  INFO  Sent 0 messages
[1651415190.280657] [alpha:35406:2]          main.cpp:63   UCX  INFO  Sent 0 messages
[1651415190.281234] [alpha:35406:7]          main.cpp:63   UCX  INFO  Sent 0 messages
[1651415190.285882] [alpha:35406:8]          main.cpp:63   UCX  INFO  Sent 0 messages
[1651415190.285882] [alpha:35406:6]          main.cpp:63   UCX  INFO  Sent 0 messages
[1651415190.324573] [alpha:35406:4]          main.cpp:63   UCX  INFO  Sent 0 messages
[1651415190.384571] [alpha:35406:3]          main.cpp:63   UCX  INFO  Sent 0 messages
[1651415194.693555] [alpha:35406:2]          main.cpp:63   UCX  INFO  Sent 1000000 messages
[1651415197.677911] [alpha:35406:1]          main.cpp:63   UCX  INFO  Sent 1000000 messages
[1651415198.418036] [alpha:35406:4]          main.cpp:63   UCX  INFO  Sent 1000000 messages
[1651415198.886526] [alpha:35406:6]          main.cpp:63   UCX  INFO  Sent 1000000 messages
[1651415199.114267] [alpha:35406:3]          main.cpp:63   UCX  INFO  Sent 1000000 messages
[1651415199.166545] [alpha:35406:5]          main.cpp:63   UCX  INFO  Sent 1000000 messages
[1651415199.724646] [alpha:35406:8]          main.cpp:63   UCX  INFO  Sent 1000000 messages
[1651415200.636522] [alpha:35406:7]          main.cpp:63   UCX  INFO  Sent 1000000 messages
[1651415200.882846] [alpha:35406:2]          main.cpp:63   UCX  INFO  Sent 2000000 messages
[1651415204.444358] [alpha:35406:4]          main.cpp:63   UCX  INFO  Sent 2000000 messages
[1651415204.816403] [alpha:35406:5]          main.cpp:63   UCX  INFO  Sent 2000000 messages
[1651415206.214550] [alpha:35406:1]          main.cpp:63   UCX  INFO  Sent 2000000 messages
[1651415206.385602] [alpha:35406:8]          main.cpp:63   UCX  INFO  Sent 2000000 messages
[1651415206.975833] [alpha:35406:2]          main.cpp:63   UCX  INFO  Sent 3000000 messages
[1651415207.601522] [alpha:35406:3]          main.cpp:63   UCX  INFO  Sent 2000000 messages
[1651415208.547422] [alpha:35406:6]          main.cpp:63   UCX  INFO  Sent 2000000 messages
[1651415208.750911] [alpha:35406:7]          main.cpp:63   UCX  INFO  Sent 2000000 messages
[1651415211.744112] [alpha:35406:5]          main.cpp:63   UCX  INFO  Sent 3000000 messages
[1651415212.593172] [alpha:35406:1]          main.cpp:63   UCX  INFO  Sent 3000000 messages
[1651415213.877344] [alpha:35406:8]          main.cpp:63   UCX  INFO  Sent 3000000 messages
[1651415214.351052] [alpha:35406:4]          main.cpp:63   UCX  INFO  Sent 3000000 messages
[1651415214.617614] [alpha:35406:3]          main.cpp:63   UCX  INFO  Sent 3000000 messages
[1651415215.926639] [alpha:35406:7]          main.cpp:63   UCX  INFO  Sent 3000000 messages
[1651415215.929791] [alpha:35406:6]          main.cpp:63   UCX  INFO  Sent 3000000 messages
[1651415216.351539] [alpha:35406:2]          main.cpp:63   UCX  INFO  Sent 4000000 messages
[1651415217.607853] [alpha:35406:2]          main.cpp:67   UCX  INFO  Finished sending messages
[1651415218.577598] [alpha:35406:5]          main.cpp:63   UCX  INFO  Sent 4000000 messages
[1651415218.647513] [alpha:35406:4]          main.cpp:63   UCX  INFO  Sent 4000000 messages
[1651415219.015270] [alpha:35406:5]          main.cpp:67   UCX  INFO  Finished sending messages
[1651415219.259772] [alpha:35406:7]          main.cpp:63   UCX  INFO  Sent 4000000 messages
[1651415219.407499] [alpha:35406:4]          main.cpp:67   UCX  INFO  Finished sending messages
[1651415219.427268] [alpha:35406:1]          main.cpp:63   UCX  INFO  Sent 4000000 messages
[1651415219.530426] [alpha:35406:7]          main.cpp:67   UCX  INFO  Finished sending messages
[1651415219.668728] [alpha:35406:1]          main.cpp:67   UCX  INFO  Finished sending messages
[1651415219.710800] [alpha:35406:6]          main.cpp:63   UCX  INFO  Sent 4000000 messages
[1651415219.767727] [alpha:35406:3]          main.cpp:63   UCX  INFO  Sent 4000000 messages
[1651415219.819378] [alpha:35406:6]          main.cpp:67   UCX  INFO  Finished sending messages
[1651415219.825500] [alpha:35406:3]          main.cpp:67   UCX  INFO  Finished sending messages
[1651415219.832556] [alpha:35406:8]          main.cpp:63   UCX  INFO  Sent 4000000 messages
[1651415219.846594] [alpha:35406:8]          main.cpp:67   UCX  INFO  Finished sending messages
[1651415219.846683] [alpha:35406:0]          main.cpp:106  UCX  INFO  Finished. Shutting down...

Server Side

$ UCX_LOG_LEVEL=info UCX_TLS=shm,tcp ./openucx_examples server
[1651415187.472526] [alpha:35403:0]          main.cpp:210  UCX  INFO  Initializing context...
[1651415187.472564] [alpha:35403:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /usr/lib/libucp.so.0)
[1651415187.478646] [alpha:35403:0]          main.cpp:223  UCX  INFO  Creating worker...
[1651415187.484677] [alpha:35403:0]       rdmacm_cm.c:928  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651415187.484693] [alpha:35403:0]      ucp_worker.c:1462 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651415187.484730] [alpha:35403:0]          parser.c:1916 UCX  INFO  UCX_* env variables: UCX_TLS=shm,tcp UCX_LOG_LEVEL=info
[1651415187.484754] [alpha:35403:0]          main.cpp:111  UCX  INFO  Registering active message handler...
[1651415187.484762] [alpha:35403:0]          main.cpp:135  UCX  INFO  Initializing listener...
[1651415187.484791] [alpha:35403:0]          main.cpp:161  UCX  INFO  Waiting on client connection...
[1651415190.251460] [alpha:35403:0]          main.cpp:168  UCX  INFO  Accepting client connection...
[1651415190.251582] [alpha:35403:0]      ucp_worker.c:1763 UCX  INFO    ep_cfg[0]: tag(tcp/lo) rma_am(tcp/lo) am(tcp/lo)
[1651415190.252798] [alpha:35403:0]          main.cpp:180  UCX  INFO  Connection established
[1651415190.278317] [alpha:35403:0]      ucp_worker.c:1763 UCX  INFO    ep_cfg[1]: tag(sysv/memory cma/memory) rma(sysv/memory posix/memory) am(sysv/memory cma/memory)
[1651415219.846685] [alpha:35403:0]          main.cpp:187  UCX  INFO  Received last message. Shutting down...

krakowski avatar May 01 '22 14:05 krakowski

@krakowski thanks for testing it! I'll think about upstreaming it.

I noticed that the first transport lane after reconfiguration changed from posix/memory to sysv/memory.

yeah, interesting... I compared locally outputs of the master branch and https://github.com/dmitrygx/ucx/tree/topic/ucp/ep_reconfig. Thay both provide me the same transports.

Is it possible to collect the output from the following command for master, v1.12.1, and my changed on your machine, pls?

UCX_TLS=tcp,shm UCX_LOG_LEVEL=info ./install/master/bin/ucx_info -eptw -u tra -P intra -A 127.0.0.1

It should help me to understand why UCX changes selection logic. Probably, the difference between the latest UCX and v1.12.1.

dmitrygx avatar May 02 '22 10:05 dmitrygx

Sure, I'm glad I can help :slightly_smiling_face:

master

$ UCX_TLS=tcp,shm UCX_LOG_LEVEL=info ucx_info -eptw -u tra -P intra -A 127.0.0.1
[1651489956.267674] [alpha:33246:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /usr/lib/libucp.so.0)
#
# UCP context
#
#     component 0  :  self
#     component 1  :  tcp
#     component 2  :  sysv
#     component 3  :  posix
#     component 4  :  ib
#     component 5  :  rdmacm
#     component 6  :  cma
#
#            md 0  :  component 1  tcp 
#            md 1  :  component 2  sysv 
#            md 2  :  component 3  posix 
#            md 3  :  component 6  cma 
#
#      resource 0  :  md 0  dev 0  flags -- tcp/lo
#      resource 1  :  md 0  dev 1  flags -- tcp/wlp4s0
#      resource 2  :  md 1  dev 2  flags -- sysv/memory
#      resource 3  :  md 2  dev 2  flags -- posix/memory
#      resource 4  :  md 3  dev 2  flags -- cma/memory
#
# memory: 1.71MB, file descriptors: 6
# create time: 5.444 ms
#
[1651489956.279711] [alpha:33246:0]       rdmacm_cm.c:928  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651489956.279730] [alpha:33246:0]      ucp_worker.c:1462 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651489956.279782] [alpha:33246:0]          parser.c:1916 UCX  INFO  UCX_* env variables: UCX_TLS=tcp,shm UCX_LOG_LEVEL=info
[1651489956.286654] [alpha:33246:0]       rdmacm_cm.c:928  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651489956.286672] [alpha:33246:0]      ucp_worker.c:1462 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651489956.287403] [alpha:33246:0]      ucp_worker.c:1763 UCX  INFO  ep_cfg[1]: tag(tcp/lo) rma_am(tcp/lo) amo_am(tcp/lo)
[1651489956.288636] [alpha:33246:0]      ucp_worker.c:1763 UCX  INFO    ep_cfg[2]: tag(tcp/lo) rma_am(tcp/lo) amo_am(tcp/lo)
[1651489956.289137] [alpha:33246:0]      ucp_worker.c:1763 UCX  INFO    ep_cfg[3]: tag(sysv/memory cma/memory) rma(sysv/memory posix/memory) amo(sysv/memory posix/memory)
#
# UCP endpoint 
#
#               peer: 127.0.0.1:48617
#                 lane[0]: cm rdmacm
#                 lane[1]:  2:sysv/memory.0 md[1]           -> md[1]/sysv/sysdev[255] amo#0 am am_bw#0
#                 lane[2]:  3:posix/memory.0 md[2]          -> md[2]/posix/sysdev[255] amo#1
#                 lane[3]:  4:cma/memory.0 md[3]            -> md[3]/cma/sysdev[255] rma_bw#0
#
#                tag_send: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..93..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
#                  put[1]: 0..<short>..4294967296..<bcopy>..(inf)
#                  get[1]: 0..<bcopy>..(inf)
#                  put[2]: 0..<short>..4294967296..<bcopy>..(inf)
#                  get[2]: 0..<bcopy>..(inf)
#
#                  rma_bw: mds rndv_rkey_size 9
#
[1651489956.290599] [alpha:33246:0]          ucp_ep.c:1402 UCX  DIAG  ep 0x7f87ec29e0a0: error 'Connection reset by remote peer' on CM lane will not be handled since no error callback is installed

v1.12.1

$ UCX_TLS=tcp,shm UCX_LOG_LEVEL=info ucx_info -eptw -u tra -P intra -A 127.0.0.1
[1651490470.830512] [alpha:61762:0]     ucp_context.c:1778 UCX  INFO  UCP version is 1.12 (release 1)
#
# UCP context
#
#     component 0  :  posix
#     component 1  :  sysv
#     component 2  :  self
#     component 3  :  tcp
#     component 4  :  ib
#     component 5  :  rdmacm
#     component 6  :  cma
#
#            md 0  :  component 0  posix 
#            md 1  :  component 1  sysv 
#            md 2  :  component 3  tcp 
#            md 3  :  component 6  cma 
#
#      resource 0  :  md 0  dev 0  flags -- posix/memory
#      resource 1  :  md 1  dev 0  flags -- sysv/memory
#      resource 2  :  md 2  dev 1  flags -- tcp/lo
#      resource 3  :  md 2  dev 2  flags -- tcp/wlp4s0
#      resource 4  :  md 3  dev 0  flags -- cma/memory
#
# memory: 1.59MB, file descriptors: 0
# create time: 5.597 ms
#
[1651490470.843673] [alpha:61762:0]       rdmacm_cm.c:917  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651490470.843693] [alpha:61762:0]      ucp_worker.c:1361 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651490470.843749] [alpha:61762:0]          parser.c:1914 UCX  INFO  UCX_* env variables: UCX_TLS=tcp,shm UCX_LOG_LEVEL=info
[1651490470.850027] [alpha:61762:0]       rdmacm_cm.c:917  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651490470.850049] [alpha:61762:0]      ucp_worker.c:1361 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651490470.852298] [alpha:61762:0]      ucp_worker.c:1866 UCX  INFO    ep_cfg[2]: tag(tcp/lo); rma_am(tcp/lo); amo_am(tcp/lo); 
[1651490470.852697] [alpha:61762:0]      ucp_worker.c:1866 UCX  INFO    ep_cfg[3]: tag(posix/memory cma/memory); rma(posix/memory sysv/memory); amo(posix/memory sysv/memory); 
#
# UCP endpoint 
#
#               peer: 127.0.0.1:56329
#                 lane[0]: cm rdmacm
#                 lane[1]:  0:posix/memory.0 md[0]          -> md[0]/posix/sysdev[255] amo#0 am am_bw#0
#                 lane[2]:  1:sysv/memory.0 md[1]           -> md[1]/sysv/sysdev[255] amo#1
#                 lane[3]:  4:cma/memory.0 md[3]            -> md[3]/cma/sysdev[255] rma_bw#0
#
#                tag_send: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..93..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
#                  put[1]: 0..<short>..4294967296..<bcopy>..(inf)
#                  get[1]: 0..<bcopy>..(inf)
#                  put[2]: 0..<short>..4294967296..<bcopy>..(inf)
#                  get[2]: 0..<bcopy>..(inf)
#
#                  rma_bw: mds rndv_rkey_size 9
#
[1651490470.854063] [alpha:61762:0]          ucp_ep.c:1243 UCX  DIAG  ep 0x7fe859eae098: error 'Connection reset by remote peer' on CM lane will not be handled since no error callback is installed

topic/ucp/ep_reconfig

$ UCX_TLS=tcp,shm UCX_LOG_LEVEL=info ucx_info -eptw -u tra -P intra -A 127.0.0.1
[1651489747.211041] [alpha:28210:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /usr/lib/libucp.so.0)
#
# UCP context
#
#     component 0  :  self
#     component 1  :  tcp
#     component 2  :  sysv
#     component 3  :  posix
#     component 4  :  ib
#     component 5  :  rdmacm
#     component 6  :  cma
#
#            md 0  :  component 1  tcp 
#            md 1  :  component 2  sysv 
#            md 2  :  component 3  posix 
#            md 3  :  component 6  cma 
#
#      resource 0  :  md 0  dev 0  flags -- tcp/lo
#      resource 1  :  md 0  dev 1  flags -- tcp/wlp4s0
#      resource 2  :  md 1  dev 2  flags -- sysv/memory
#      resource 3  :  md 2  dev 2  flags -- posix/memory
#      resource 4  :  md 3  dev 2  flags -- cma/memory
#
# memory: 1.57MB, file descriptors: 5
# create time: 4.408 ms
#
[1651489747.221065] [alpha:28210:0]       rdmacm_cm.c:928  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651489747.226928] [alpha:28210:0]      ucp_worker.c:1462 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651489747.227019] [alpha:28210:0]          parser.c:1916 UCX  INFO  UCX_* env variables: UCX_TLS=tcp,shm UCX_LOG_LEVEL=info
[1651489747.232854] [alpha:28210:0]       rdmacm_cm.c:928  UCX  DIAG  rdma_create_event_channel failed: No such device
[1651489747.240294] [alpha:28210:0]      ucp_worker.c:1462 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
[1651489747.241099] [alpha:28210:0]      ucp_worker.c:1763 UCX  INFO  ep_cfg[1]: tag(tcp/lo) rma_am(tcp/lo) amo_am(tcp/lo)
[1651489747.242243] [alpha:28210:0]      ucp_worker.c:1763 UCX  INFO    ep_cfg[2]: tag(tcp/lo) rma_am(tcp/lo) amo_am(tcp/lo)
[1651489747.242732] [alpha:28210:0]      ucp_worker.c:1763 UCX  INFO    ep_cfg[3]: tag(sysv/memory cma/memory) rma(sysv/memory posix/memory) amo(sysv/memory posix/memory)
#
# UCP endpoint 
#
#               peer: 127.0.0.1:58117
#                 lane[0]: cm rdmacm
#                 lane[1]:  2:sysv/memory.0 md[1]           -> md[1]/sysv/sysdev[255] amo#0 am am_bw#0
#                 lane[2]:  3:posix/memory.0 md[2]          -> md[2]/posix/sysdev[255] amo#1
#                 lane[3]:  4:cma/memory.0 md[3]            -> md[3]/cma/sysdev[255] rma_bw#0
#
#                tag_send: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..93..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..93..<egr/bcopy>..8256..<rndv>..(inf)
#                  put[1]: 0..<short>..4294967296..<bcopy>..(inf)
#                  get[1]: 0..<bcopy>..(inf)
#                  put[2]: 0..<short>..4294967296..<bcopy>..(inf)
#                  get[2]: 0..<bcopy>..(inf)
#
#                  rma_bw: mds rndv_rkey_size 9
#
[1651489747.244383] [alpha:28210:0]          ucp_ep.c:1402 UCX  DIAG  ep 0x7f114e76e0a0: error 'Connection reset by remote peer' on CM lane will not be handled since no error callback is installed

krakowski avatar May 02 '22 11:05 krakowski

@krakowski thanks you! It is very helpful. yes, v1.12.1 has a difference with selection between sysv and posix in compare to master and my local branch. It seems that something was changed in the latest UCX.

dmitrygx avatar May 02 '22 12:05 dmitrygx