mercury icon indicating copy to clipboard operation
mercury copied to clipboard

NA OFI: verbs provider uses one HG_CONTEXT and two endpoints can not work properly

Open AdjoiningCat opened this issue 2 years ago • 1 comments

Describe I have two servers run with ofi+verbs on the same machine, which gave the listen addresses: ofi+verbs; ofi_rxm://192.168.161.1:39022 and ofi+verbs;ofi_rxm://192.168.161.1:39043.

On the client side, I created one HG_CLASS and one HG_CONTEXT through HG_Init_opt with string "ofi+verbs", two endpoints through HG_Addr_lookup2 with the listen addresses.

Only one endpoint can work properly, the other endpoint can not send the message: na_ofi_msg_send(): fi_tsend() failed, rc: -9 (Bad file descriptor)

Expected behavior both two endpoints of the client should work fine

Platform (please complete the following information): mercury version: 2.1.0 libfabric version: 1.13.2

Environment variables: FI_VERBS_IFACE=ens800f0 FI_PROVIDER=verbs

What puzzles me is that it works properly with libfabric_v1.8.1. Is this a known issue or should I look for usage errors?

AdjoiningCat avatar Jun 24 '22 07:06 AdjoiningCat

I'm not aware of any issue of this type. Usually that message indicates that the peer you are attempting to communicate with is gone. In any case, can you please try to update to a newer libfabric? latest is 1.15.1.

soumagne avatar Jun 24 '22 14:06 soumagne

it finally turns out related to our interception system, it works properly if I setup connections to each server before starting interception

AdjoiningCat avatar Nov 21 '22 07:11 AdjoiningCat