ucx icon indicating copy to clipboard operation
ucx copied to clipboard

create qp: failed on ibv_cmd_create_qp with 22

Open edofrederix opened this issue 4 years ago • 23 comments

Describe the bug

I'm trying to get UCX going with OpenMPI so that I can use my QLogic FastLinQ QL41000 cards with RoCE. OpenMPI without UCX and with just libverbs does not fly -- we have very inconsistent behavior with that. Maybe the issue I'm posting here is related to that, I don't know.

Compilation of UCX is going well, but using the UCX PML in OpenMPI gives problems, namely:

[qelr_create_qp:679]create qp: failed on ibv_cmd_create_qp with 22
[1588956223.719348] [hostname:55453:0]       ib_iface.c:623  UCX  ERROR iface=0x2535490: failed to create UD QP TX wr:256 sge:2 inl:64 RX wr:4096 sge:1 inl 0: Invalid argument
[hostname:55453] pml_ucx.c:273  Error: Failed to create UCP worker
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      ...
  Framework: pml
--------------------------------------------------------------------------

And this is even without the openib BTL. Also, ucx_info -d gives the same error, i.e.,

[qelr_create_qp:679]create qp: failed on ibv_cmd_create_qp with 22
[1588956288.177889] [vinci115:55483:0]       ib_iface.c:623  UCX  ERROR iface=0xc69c80: failed to create UD QP TX wr:256 sge:2 inl:64 RX wr:4096 sge:1 inl 0: Invalid argument
#   < failed to open interface >

Steps to Reproduce

Simply run ucx_info -d

#ucx_info -v
# UCT version=1.8.0 revision c30b7da
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/software/ucx/1.8.0 --with-verbs --with-rdmacm --enable-devel-headers

No environment flags used.

#cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)
#uname -a
Linux hostname 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Wed Feb 26 03:08:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
#rpm -q rdma-core
rdma-core-26.0-8.el8.x86_64
#lspci | grep QLogic
01:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
01:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
#ucx_info -d
#
# Memory domain: posix
#     Component: posix
#             allocate: unlimited
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#   Transport: posix
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#   Transport: sysv
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: self
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: tcp
#      Device: eno1
#
#      capabilities:
#            bandwidth: 2829.09/ppn + 0.00 MB/sec
#              latency: 5223 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Connection manager: tcp
#      max_conn_priv: 2040 bytes
#
# Memory domain: sockcm
#     Component: sockcm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: qedr0
#     Component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc_verbs
#      Device: qedr0:1
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 800 nsec + 1 * N
#             overhead: 75 nsec
#            put_short: <= 64
#            put_bcopy: <= 8256
#            put_zcopy: <= 2G, up to 3 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 1..2G, up to 3 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 63
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 2 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 127
#           connection: to ep
#             priority: 0
#       device address: 17 bytes
#           ep address: 3 bytes
#       error handling: peer failure
#
#
#   Transport: ud_verbs
#      Device: qedr0:1
[qelr_create_qp:679]create qp: failed on ibv_cmd_create_qp with 22
[1588956628.651655] [hostname:55526:0]       ib_iface.c:623  UCX  ERROR iface=0x7bf660: failed to create UD QP TX wr:256 sge:2 inl:64 RX wr:4096 sge:1 inl 0: Invalid argument
#   < failed to open interface >
#
# Memory domain: qedr1
#     Component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc_verbs
#      Device: qedr1:1
#
#      capabilities:
#            bandwidth: 1095.78/ppn + 0.00 MB/sec
#              latency: 1500 nsec + 1 * N
#             overhead: 75 nsec
#            put_short: <= 64
#            put_bcopy: <= 8256
#            put_zcopy: <= 2G, up to 3 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 1..2G, up to 3 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 63
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 2 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 127
#           connection: to ep
#             priority: 0
#       device address: 17 bytes
#           ep address: 3 bytes
#       error handling: peer failure
#
#
#   Transport: ud_verbs
#      Device: qedr1:1
[qelr_create_qp:679]create qp: failed on ibv_cmd_create_qp with 22
[1588956628.670611] [hostname:55526:0]       ib_iface.c:623  UCX  ERROR iface=0x7c0c80: failed to create UD QP TX wr:256 sge:2 inl:64 RX wr:4096 sge:1 inl 0: Invalid argument
#   < failed to open interface >
#
# Memory domain: rdmacm
#     Component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#   Transport: cma
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#

Additional information (depending on the issue)

OpenMPI version = 4.0.3 config.log

edofrederix avatar May 08 '20 17:05 edofrederix

@edofrederix which UCX version are you using? can you pls try with following patch + set UCX_UD_RX_INLINE=0?

diff --git a/src/uct/ib/ud/base/ud_iface.c b/src/uct/ib/ud/base/ud_iface.c
index e666b0b..126ed08 100644
--- a/src/uct/ib/ud/base/ud_iface.c
+++ b/src/uct/ib/ud/base/ud_iface.c
@@ -267,8 +267,7 @@ uct_ud_iface_create_qp(uct_ud_iface_t *self, const uct_ud_iface_config_t *config
     qp_init_attr.cap.max_recv_wr     = config->super.rx.queue_len;
     qp_init_attr.cap.max_send_sge    = 2;
     qp_init_attr.cap.max_recv_sge    = 1;
-    qp_init_attr.cap.max_inline_data = ucs_max(config->super.tx.min_inline,
-                                               UCT_UD_MIN_INLINE);
+    qp_init_attr.cap.max_inline_data = config->super.tx.min_inline;
 
     status = ops->create_qp(&self->super, &qp_init_attr, &self->qp);

yosefe avatar May 08 '20 17:05 yosefe

UCX-1.8.0, and I also tried with the latest master.

Thanks for the suggestion. I applied your patch. Doesn't change anything though, I get the same error.

Not sure how to set UCX_UD_RX_INLINE though -- cannot find any documentation about that variable. I did set UCT_UD_MIN_INLINE to zero , in src/uct/ib/ud/base/ud_def.h. Doesn't make a difference. Its only occurrence is in an assert anyway. Can you give a hint on how to set UCX_UD_RX_INLINE?

edofrederix avatar May 08 '20 19:05 edofrederix

Very similar error with UCX-1.4.0 btw:

[qelr_create_cq:258]create cq: failed with rc = 22
[1588966463.076804] [hostname:2107996:0]       ib_iface.c:472  UCX  ERROR ibv_create_cq(cqe=4096) failed: Invalid argument

edofrederix avatar May 08 '20 19:05 edofrederix

Not sure how to set UCX_UD_RX_INLINE though -- cannot find any documentation about that variable

Sorry, this is an environment variable. Need to set it to 0, along with applying the patch above

yosefe avatar May 08 '20 20:05 yosefe

Same thing

# printenv | grep UCX
UCX_ROOT=/software/ucx/1.8.0
UCX_DIR=/software/ucx/1.8.0
UCX_HOME=/software/ucx/1.8.0
UCX_UD_RX_INLINE=0

is my UCX-related environment. Still I get:

[qelr_create_qp:679]create qp: failed on ibv_cmd_create_qp with 22
[1588969380.594241] [hostname:17141:0]       ib_iface.c:623  UCX  ERROR iface=0x729c40: failed to create UD QP TX wr:256 sge:2 inl:64 RX wr:4096 sge:1 inl 0: Invalid argument

Btw, to make things more confusing: the error from 1.4.0 was actually on a different machine. It has the same hardware but some different configurations.. On the original machine with 1.4.0 when doing ucx_info -d I get:

[qelr_create_qp:679]create qp: failed on ibv_cmd_create_qp with 22
[1588969380.594241] [hostname:17141:0]       ib_iface.c:623  UCX  ERROR iface=0x729c40: failed to create UD QP TX wr:256 sge:2 inl:64 RX wr:4096 sge:1 inl 0: Invalid argument

edofrederix avatar May 08 '20 20:05 edofrederix

@edofrederix can you pls provide the output of ibv_devinfo -vv?

yosefe avatar May 09 '20 10:05 yosefe

hca_id: qedr0
        transport:                      InfiniBand (0)
        fw_ver:                         8.37.7.0
        node_guid:                      f6e9:d4ff:fe61:b108
        sys_image_guid:                 f6e9:d4ff:fe61:b108
        vendor_id:                      0x1077
        vendor_part_id:                 32880
        hw_ver:                         0x0
        phys_port_cnt:                  1
        max_mr_size:                    0x10000000000
        page_size_cap:                  0xfffff000
        max_qp:                         8568
        max_qp_wr:                      32767
        device_cap_flags:               0x00209080
                                        CURR_QP_STATE_MOD
                                        RC_RNR_NAK_GEN
                                        MEM_MGT_EXTENSIONS
                                        Unknown flags: 0x8000
        max_sge:                        4
        max_sge_rd:                     4
        max_cq:                         17136
        max_cqe:                        8388480
        max_mr:                         131070
        max_pd:                         65536
        max_qp_rd_atom:                 32
        max_ee_rd_atom:                 0
        max_res_rd_atom:                0
        max_qp_init_rd_atom:            32
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_GLOB (2)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  0
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0
        max_ah:                         8568
        max_fmr:                        131071
        max_map_per_fmr:                16
        max_srq:                        8192
        max_srq_wr:                     32767
        max_srq_sge:                    0
        max_pkeys:                      1
        local_ca_ack_delay:             15
        general_odp_caps:
        rc_odp_caps:
                                        NO SUPPORT
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        NO SUPPORT
        xrc_odp_caps:
                                        NO SUPPORT
        completion_timestamp_mask not supported
        core clock not supported
        device_cap_flags_ex:            0x0
        tso_caps:
        max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        tag matching not supported
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        max_msg_sz:             0x80000000
                        port_cap_flags:         0x04000000
                        port_cap_flags2:        0x0000
                        max_vl_num:             8 (4)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            128
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           1X (1)
                        active_speed:           25.0 Gbps (32)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               fe80:0000:0000:0000:f6e9:d4ff:fe61:b108
                        GID[  1]:               fe80:0000:0000:0000:f6e9:d4ff:fe61:b108
                        GID[  2]:               0000:0000:0000:0000:0000:ffff:0a05:092e
                        GID[  3]:               0000:0000:0000:0000:0000:ffff:0a05:092e

hca_id: qedr1
        transport:                      InfiniBand (0)
        fw_ver:                         8.37.7.0
        node_guid:                      f6e9:d4ff:fe61:b109
        sys_image_guid:                 f6e9:d4ff:fe61:b109
        vendor_id:                      0x1077
        vendor_part_id:                 32880
        hw_ver:                         0x0
        phys_port_cnt:                  1
        max_mr_size:                    0x10000000000
        page_size_cap:                  0xfffff000
        max_qp:                         8568
        max_qp_wr:                      32767
        device_cap_flags:               0x00209080
                                        CURR_QP_STATE_MOD
                                        RC_RNR_NAK_GEN
                                        MEM_MGT_EXTENSIONS
                                        Unknown flags: 0x8000
        max_sge:                        4
        max_sge_rd:                     4
        max_cq:                         17136
        max_cqe:                        8388480
        max_mr:                         131070
        max_pd:                         65536
        max_qp_rd_atom:                 32
        max_ee_rd_atom:                 0
        max_res_rd_atom:                0
        max_qp_init_rd_atom:            32
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_GLOB (2)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  0
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0
        max_ah:                         8568
        max_fmr:                        131071
        max_map_per_fmr:                16
        max_srq:                        8192
        max_srq_wr:                     32767
        max_srq_sge:                    0
        max_pkeys:                      1
        local_ca_ack_delay:             15
        general_odp_caps:
        rc_odp_caps:
                                        NO SUPPORT
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        NO SUPPORT
        xrc_odp_caps:
                                        NO SUPPORT
        completion_timestamp_mask not supported
        core clock not supported
        device_cap_flags_ex:            0x0
        tso_caps:
        max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        tag matching not supported
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        max_msg_sz:             0x80000000
                        port_cap_flags:         0x04000000
                        port_cap_flags2:        0x0000
                        max_vl_num:             8 (4)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            128
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           1X (1)
                        active_speed:           10.0 Gbps (4)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               fe80:0000:0000:0000:f6e9:d4ff:fe61:b109
                        GID[  1]:               fe80:0000:0000:0000:f6e9:d4ff:fe61:b109

edofrederix avatar May 09 '20 13:05 edofrederix

FYI, the qedr0 device is connected using 25 Gbps via a router and the qedr1 using 10 Gbps directly to another similar second node on which I'm testing.

edofrederix avatar May 09 '20 14:05 edofrederix

@edofrederix can you pls try UCX_UD_TX_MIN_INLINE=0 (instead of UCX_UD_RX_INLINE=0) are there any errors in dmesg?

yosefe avatar May 11 '20 15:05 yosefe

@yosefe thanks. Same error with that environment variable. In dmesg I do see:

hugetlbfs: ucx_info (4790): Using mlock ulimits for SHM_HUGETLB is deprecated

Not sure if that's related?

edofrederix avatar May 11 '20 16:05 edofrederix

it's not related.. in the previous error message: [1588969380.594241] [hostname:17141:0] ib_iface.c:623 UCX ERROR iface=0x729c40: failed to create UD QP TX wr:256 sge:2 inl:64 RX wr:4096 sge:1 inl 0: Invalid argument there was inl:64

Does inl:64 appear in the error message when adding UCX_UD_TX_MIN_INLINE=0?

yosefe avatar May 11 '20 16:05 yosefe

@yosefe no it turns to inl:0, i.e.,

[qelr_create_qp:679]create qp: failed on ibv_cmd_create_qp with 22
[1589222241.869252] [vinci115:12717:0]       ib_iface.c:623  UCX  ERROR iface=0x1931bc0: failed to create UD QP TX wr:256 sge:2 inl:0 RX wr:4096 sge:1 inl 0: Invalid argument

edofrederix avatar May 11 '20 18:05 edofrederix

@edofrederix in this case some other UD QP parameter is not accepted by QLogic driver.. Are you able to run basic ib_send_lat test on the setup? Can you pls post its output? server: ib_send_lat -c UD client: ib_send_lat -c UD localhost

yosefe avatar May 11 '20 18:05 yosefe

@yosefe here's the output on both ends:

# ib_send_lat -c UD -d qedr1 -x 3 localhost
[qelr_create_qp:679]create qp: failed on ibv_cmd_create_qp with 22
Unable to create QP.
Failed to create QP.
 Couldn't create IB resources

It does work for the RC connection type. I tried different GIDs and also for the other device (qedr0), but those also fail to create the queue pair.

We've been testing connections with other ibv_* tools (ib_send_bw, ibv_ud_pingpong) before, but never for the UD connection type. Rerunning those tests for UD fails with the same error as for ibv_send_lat.

How can we get UD QP to work? Any kernel module parameters you recommend I can play with?

edofrederix avatar May 11 '20 19:05 edofrederix

@edofrederix my knowledge on QLogic devices is limited, but it seems this device doesn't support UD transport, only RC. UCX, on other hand, requires UD transport, at least as method to bootstrap RC.

yosefe avatar May 11 '20 19:05 yosefe

@yosefe I think something else is going on. I don't think @jgunthorpe accepts rdma-core providers with no UD support, which sounds like basic feature to have.

shamisp avatar May 11 '20 20:05 shamisp

@yosefe, thanks a lot for your support so far. Assuming that this device indeed lacks UD support, would you expect OpenMPI performance to slow down or break, even without the UCX PML? Maybe a bit of an off-topic question for this issue, but this is where my motivation came from for looking at UCX to begin with.

Can somebody else maybe shed some light on this issue? Thanks!

edofrederix avatar May 12 '20 10:05 edofrederix

@yosefe, thanks a lot for your support so far. Assuming that this device indeed lacks UD support, would you expect OpenMPI performance to slow down or break, even without the UCX PML? Maybe a bit of an off-topic question for this issue, but this is where my motivation came from for looking at UCX to begin with.

@edofrederix i think it would be better to ask this question on OpenMPI community (devel list or github)

yosefe avatar May 12 '20 10:05 yosefe

No UD is fine, providers just have to provide some level of RDMA capability, like RC is fine. iWarp for instance doesn't support UD. I also know nothing about qedr to tell if it even should support UD.

jgunthorpe avatar May 14 '20 22:05 jgunthorpe

@yosefe is wire up over tcp not supported for now ?

shamisp avatar May 15 '20 01:05 shamisp

@shamisp it's not supported today

yosefe avatar May 15 '20 07:05 yosefe

Hi @edofrederix,

did you find any solution to your problem. I kind have the same with a HPE Synergy 4820C adapter (With an Marvell QL45604 network prozessor). RC transport ist working, but OpenMPI fails due to UCX needing UD transport. btl openib does not work either.

ojschumann avatar Sep 07 '22 14:09 ojschumann

No. Ended up ditching the QLogics in favor of ConnectX-4s. With those it's working very well.

edofrederix avatar Sep 07 '22 15:09 edofrederix