ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCT/IB/UD: Fix connection matching

Open dmitrygx opened this issue 3 years ago • 9 comments

What

Fix connection matching in UD transport.

Why ?

If user destroyed UCT EP, we shouldn't disconenct it, because it could be used for RX operations.

How ?

  1. Introduce TX and RX flags.
  2. If both TX and RX flags are set, endpoint shouldn't be added to CEP.
  3. If EP has TX-only - add to expected connection matching queue, if has RX-only - add to unexpected one.
  4. Send FIN packet if closing won't use TX capability anymore on EP.
  5. Handle FIN packet to remove RX capability on a EP-receiver.

dmitrygx avatar Mar 29 '22 15:03 dmitrygx

@brminich @hoopoepg could you review pls?

dmitrygx avatar Apr 01 '22 10:04 dmitrygx

@yosefe @brminich could you review pls?

dmitrygx avatar Apr 06 '22 11:04 dmitrygx

@brminich @yosefe could you review pls?

dmitrygx avatar Apr 09 '22 14:04 dmitrygx

in general, maybe we don't need these fixes for keepalive? I would expect that when using CM, ud ep will be connected in p2p mode, so it will not use conn_match flow

we need it for CONNECT_TO_WORKER mode anyway. btw, it was bug in Keepalive mode selection (fixed by #8113), we used AM-based keepalive - so we created reply EP to send EP_REMOVED with CONNECT_TO_WORKER mode. I'm testing #8113 to check whether it is enough to fix "ACKed and TX PSNs mismatch".

dmitrygx avatar Apr 10 '22 09:04 dmitrygx

in general, maybe we don't need these fixes for keepalive? I would expect that when using CM, ud ep will be connected in p2p mode, so it will not use conn_match flow

we need it for CONNECT_TO_WORKER mode anyway. btw, it was bug in Keepalive mode selection (fixed by #8113), we used AM-based keepalive - so we created reply EP to send EP_REMOVED with CONNECT_TO_WORKER mode. I'm testing #8113 to check whether it is enough to fix "ACKed and TX PSNs mismatch".

Can we fix the p2p keepalive prior to fixing CONNECT_TO_WORKER mode, to make io_demo pass?

yosefe avatar Apr 10 '22 12:04 yosefe

in general, maybe we don't need these fixes for keepalive? I would expect that when using CM, ud ep will be connected in p2p mode, so it will not use conn_match flow

we need it for CONNECT_TO_WORKER mode anyway. btw, it was bug in Keepalive mode selection (fixed by #8113), we used AM-based keepalive - so we created reply EP to send EP_REMOVED with CONNECT_TO_WORKER mode. I'm testing #8113 to check whether it is enough to fix "ACKed and TX PSNs mismatch".

Can we fix the p2p keepalive prior to fixing CONNECT_TO_WORKER mode, to make io_demo pass?

@yosefe @brminich as discussed offline, we need to add dropping packets if endpoint was disconnected, but we should fix connection matching to implement it

dmitrygx avatar Apr 20 '22 09:04 dmitrygx

In general, conn_match seems not really neeeded for keepalive with CM. We can force packet drop on a disconnected ud endpoint only on p2p-connected endpoints, so it will not affect ep-to-iface scenario.

do you mean checking DISCONNECTED flag for CONNECT_TO_EP endpoints only in 1.13? it will have extra branch in uct_ud_ep_process_rx

dmitrygx avatar Apr 22 '22 08:04 dmitrygx

do you mean checking DISCONNECTED flag for CONNECT_TO_EP endpoints only in 1.13? it will have extra branch in uct_ud_ep_process_rx

Yes: Let's start with a fix to check DISCONNECTED only for connect-to-ep. Maybe we can unite it with the dest_id check, by changing ep->dest_id of a disconnected p2p endpoint to an invalid value (instead of setting a DISCONNECTED flag). Make sure it solves the timeouts issue and port it to v1.13.x I think the full connecting matching it's a complex change in a "sensitive" place that may break other use cases, so better postpone it for now. For example maybe need to move this logic to another layer, make sure that we don't send more wireup packets now, etc.

yosefe avatar Apr 22 '22 09:04 yosefe

Let's start with a fix to check DISCONNECTED only for connect-to-ep. Maybe we can unite it with the dest_id check, by changing ep->dest_id of a disconnected p2p endpoint to an invalid value (instead of setting a DISCONNECTED flag). Make sure it solves the timeouts issue and port it to v1.13.x

submitted #8151 which checks CONNECT_TO_EP | DISCONNECT flags to make a decision about dropping RX packet. currently, testing it with IO-demo reproducer of "timeout waiting for N replies".

dmitrygx avatar Apr 22 '22 10:04 dmitrygx