ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Should there be a fallback when `mlx5dv_devx_create_event_channel` fails?

Open abellina opened this issue 3 years ago • 11 comments

We often (but not always) see errors like:

mlx5dv_devx_create_event_channel() failed: Protocol not supported

Which results in Input/output error, and our test application failing. We are using UCX 1.10.1 as released on an Ubuntu image.

In this environment we are using CX5 nics. The system is running MLNX_OFED 4, and we run within a docker container with MLNX_OFED 5 user-space libraries installed. The error doesn't happen always, which seems odd to me.

I am hoping to find out more about why this error might be intermittent, and ask about whether we should change the behavior in UCX so that there's a fallback mechanism.

@petro-rudenko suggested running with UCX_TLS=^rc_x and I'll try this for now, but I'd like understand more about why this could be happening. Thanks in advance!

abellina avatar Jun 04 '21 15:06 abellina

@Artemy-Mellanox I believe you may be familiar with devx, so pinging you here.

abellina avatar Jun 04 '21 15:06 abellina

It's not supported to run MOFED 5 userspace with MOFED 4 kernel. Both should use MOFED 5 (or a modern rdma-core and kernel).

yosefe avatar Jun 05 '21 19:06 yosefe

@yosefe would both running MOFED 4 be supported? In this scenario, would the devx error be related?

abellina avatar Jun 06 '21 16:06 abellina

i would not expect to see this devx error if both run MOFED 4

yosefe avatar Jun 06 '21 19:06 yosefe

Is there any advances in this issue? We started seeing this problem as well. Running CentOS7, ucx 1.10, mlnx-ofed 5.3. We updated to ucx 1.11 and mlnx-ofed to 5.4 and the problem is still there. Kernel 3.10.0-1160

heitorPB avatar Apr 29 '22 18:04 heitorPB

@heitorPB is the loaded mlx5_ib kernel module coming from mlnx-ofed 5.4? what is the output of cat /proc/kallsyms |grep DEVX_ASYNC?

yosefe avatar Apr 29 '22 19:04 yosefe

@yosefe:

There's no DEVX_ASYNC there :/

It appears that the mlx5_ib module did not come from mlnx-ofed:

$ modinfo mlx5_ib
filename:       /lib/modules/3.10.0-1160.62.1.el7.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz
license:        Dual BSD/GPL
description:    Mellanox Connect-IB HCA IB driver
author:         Eli Cohen <[email protected]>
retpoline:      Y
rhelversion:    7.9
srcversion:     CF2AF2DCF911C7F8B769A7B
depends:        mlx5_core,ib_core,ib_uverbs
intree:         Y
vermagic:       3.10.0-1160.62.1.el7.x86_64 SMP mod_unload modversions
signer:         CentOS Linux kernel signing key
sig_key:        02:0F:57:23:85:87:9F:57:90:20:71:0C:B6:CA:58:B2:43:1B:47:7D
sig_hashalgo:   sha256

$ rpm -qf /lib/modules/3.10.0-1160.62.1.el7.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz
kernel-3.10.0-1160.62.1.el7.x86_64

$ cat /proc/kallsyms |grep DEVX_ASYNC
$ cat /proc/kallsyms |grep -i devx
0000000000000000 t mlx5_ib_stage_devx_cleanup   [mlx5_ib]
0000000000000000 t mlx5_ib_stage_devx_init      [mlx5_ib]
0000000000000000 t devx_get_obj_id      [mlx5_ib]
0000000000000000 t devx_is_valid_obj_id [mlx5_ib]
0000000000000000 t devx_is_supported    [mlx5_ib]
0000000000000000 t devx_set_umem_valid  [mlx5_ib]
0000000000000000 t devx_obj_build_destroy_cmd   [mlx5_ib]
0000000000000000 t devx_free_indirect_mkey      [mlx5_ib]
0000000000000000 t devx_umem_cleanup    [mlx5_ib]
0000000000000000 t devx_get_uid.isra.3  [mlx5_ib]
0000000000000000 t devx_cleanup_mkey.isra.8     [mlx5_ib]
0000000000000000 t devx_obj_cleanup     [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_QUERY_EQN        [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_QUERY_UAR        [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_OBJ_CREATE       [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_OTHER    [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_OBJ_QUERY        [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_OBJ_MODIFY       [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_UMEM_REG [mlx5_ib]
0000000000000000 r _object_methods_MLX5_IB_OBJECT_DEVX_UMEM     [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_UMEM_REG  [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_UMEM_DEREG        [mlx5_ib]
0000000000000000 r _object_methods_MLX5_IB_OBJECT_DEVX_OBJ      [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OBJ_CREATE        [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OBJ_DESTROY       [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OBJ_MODIFY        [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OBJ_QUERY [mlx5_ib]
0000000000000000 r _object_methods_MLX5_IB_OBJECT_DEVX  [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OTHER     [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_QUERY_UAR [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_QUERY_EQN [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OBJ_QUERY  [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OBJ_MODIFY [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OBJ_DESTROY        [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OBJ_CREATE [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OTHER      [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_QUERY_UAR  [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_QUERY_EQN  [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_UMEM_DEREG [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_UMEM_REG   [mlx5_ib]
0000000000000000 t mlx5_ib_devx_is_flow_dest    [mlx5_ib]
0000000000000000 r mlx5_ib_object_MLX5_IB_OBJECT_DEVX   [mlx5_ib]
0000000000000000 t mlx5_ib_devx_destroy [mlx5_ib]
0000000000000000 r mlx5_ib_devx_defs    [mlx5_ib]
0000000000000000 t mlx5_ib_devx_create  [mlx5_ib]
0000000000000000 r mlx5_ib_object_MLX5_IB_OBJECT_DEVX_OBJ       [mlx5_ib]
0000000000000000 r mlx5_ib_object_MLX5_IB_OBJECT_DEVX_UMEM      [mlx5_ib]

It appears that we are using the wrong modules? Do you know how to get the correct ones?

heitorPB avatar Apr 29 '22 19:04 heitorPB

assuming mlnx_ofed 5.4 was really installed with its kernel modules (ofed_info|grep kernel), pls try service openibd restart or machine reboot

yosefe avatar Apr 29 '22 19:04 yosefe

Ouch, it appears that mlnx-ofed did not install the modules. Trying to start openibd fails and I see many messages like openibd: ERROR: Module mlx5_ib belong to kernel which is not a part of MLNX_OFED, skipping... for other modules as well...

Do you know what is the procedure/package to install to get the kernel modules from mlnx-ofed?

heitorPB avatar Apr 29 '22 19:04 heitorPB

@heitorPB the default installation command should install kernel module as well. pls see more info here: https://docs.nvidia.com/networking/display/OFED510660/Installing+Mellanox+OFED, or reach out to NVIDIA networking support for further assistance with MLNX_OFED installation.

yosefe avatar Apr 29 '22 20:04 yosefe

Thanks for the help @yosefe. We did follow that guide, but something is not right... I will continue digging.

heitorPB avatar Apr 29 '22 20:04 heitorPB