ucx
ucx copied to clipboard
Should there be a fallback when `mlx5dv_devx_create_event_channel` fails?
We often (but not always) see errors like:
mlx5dv_devx_create_event_channel() failed: Protocol not supported
Which results in Input/output error
, and our test application failing. We are using UCX 1.10.1 as released on an Ubuntu image.
In this environment we are using CX5 nics. The system is running MLNX_OFED 4, and we run within a docker container with MLNX_OFED 5 user-space libraries installed. The error doesn't happen always, which seems odd to me.
I am hoping to find out more about why this error might be intermittent, and ask about whether we should change the behavior in UCX so that there's a fallback mechanism.
@petro-rudenko suggested running with UCX_TLS=^rc_x
and I'll try this for now, but I'd like understand more about why this could be happening. Thanks in advance!
@Artemy-Mellanox I believe you may be familiar with devx, so pinging you here.
It's not supported to run MOFED 5 userspace with MOFED 4 kernel. Both should use MOFED 5 (or a modern rdma-core and kernel).
@yosefe would both running MOFED 4 be supported? In this scenario, would the devx error be related?
i would not expect to see this devx error if both run MOFED 4
Is there any advances in this issue? We started seeing this problem as well. Running CentOS7, ucx 1.10, mlnx-ofed 5.3. We updated to ucx 1.11 and mlnx-ofed to 5.4 and the problem is still there. Kernel 3.10.0-1160
@heitorPB is the loaded mlx5_ib kernel module coming from mlnx-ofed 5.4?
what is the output of cat /proc/kallsyms |grep DEVX_ASYNC
?
@yosefe:
There's no DEVX_ASYNC
there :/
It appears that the mlx5_ib module did not come from mlnx-ofed:
$ modinfo mlx5_ib
filename: /lib/modules/3.10.0-1160.62.1.el7.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz
license: Dual BSD/GPL
description: Mellanox Connect-IB HCA IB driver
author: Eli Cohen <[email protected]>
retpoline: Y
rhelversion: 7.9
srcversion: CF2AF2DCF911C7F8B769A7B
depends: mlx5_core,ib_core,ib_uverbs
intree: Y
vermagic: 3.10.0-1160.62.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: 02:0F:57:23:85:87:9F:57:90:20:71:0C:B6:CA:58:B2:43:1B:47:7D
sig_hashalgo: sha256
$ rpm -qf /lib/modules/3.10.0-1160.62.1.el7.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz
kernel-3.10.0-1160.62.1.el7.x86_64
$ cat /proc/kallsyms |grep DEVX_ASYNC
$ cat /proc/kallsyms |grep -i devx
0000000000000000 t mlx5_ib_stage_devx_cleanup [mlx5_ib]
0000000000000000 t mlx5_ib_stage_devx_init [mlx5_ib]
0000000000000000 t devx_get_obj_id [mlx5_ib]
0000000000000000 t devx_is_valid_obj_id [mlx5_ib]
0000000000000000 t devx_is_supported [mlx5_ib]
0000000000000000 t devx_set_umem_valid [mlx5_ib]
0000000000000000 t devx_obj_build_destroy_cmd [mlx5_ib]
0000000000000000 t devx_free_indirect_mkey [mlx5_ib]
0000000000000000 t devx_umem_cleanup [mlx5_ib]
0000000000000000 t devx_get_uid.isra.3 [mlx5_ib]
0000000000000000 t devx_cleanup_mkey.isra.8 [mlx5_ib]
0000000000000000 t devx_obj_cleanup [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_QUERY_EQN [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_QUERY_UAR [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_OBJ_CREATE [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_OTHER [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_OBJ_QUERY [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_OBJ_MODIFY [mlx5_ib]
0000000000000000 t mlx5_ib_handler_MLX5_IB_METHOD_DEVX_UMEM_REG [mlx5_ib]
0000000000000000 r _object_methods_MLX5_IB_OBJECT_DEVX_UMEM [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_UMEM_REG [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_UMEM_DEREG [mlx5_ib]
0000000000000000 r _object_methods_MLX5_IB_OBJECT_DEVX_OBJ [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OBJ_CREATE [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OBJ_DESTROY [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OBJ_MODIFY [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OBJ_QUERY [mlx5_ib]
0000000000000000 r _object_methods_MLX5_IB_OBJECT_DEVX [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_OTHER [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_QUERY_UAR [mlx5_ib]
0000000000000000 r mlx5_ib_method_MLX5_IB_METHOD_DEVX_QUERY_EQN [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OBJ_QUERY [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OBJ_MODIFY [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OBJ_DESTROY [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OBJ_CREATE [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_OTHER [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_QUERY_UAR [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_QUERY_EQN [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_UMEM_DEREG [mlx5_ib]
0000000000000000 r _method_attrs_MLX5_IB_METHOD_DEVX_UMEM_REG [mlx5_ib]
0000000000000000 t mlx5_ib_devx_is_flow_dest [mlx5_ib]
0000000000000000 r mlx5_ib_object_MLX5_IB_OBJECT_DEVX [mlx5_ib]
0000000000000000 t mlx5_ib_devx_destroy [mlx5_ib]
0000000000000000 r mlx5_ib_devx_defs [mlx5_ib]
0000000000000000 t mlx5_ib_devx_create [mlx5_ib]
0000000000000000 r mlx5_ib_object_MLX5_IB_OBJECT_DEVX_OBJ [mlx5_ib]
0000000000000000 r mlx5_ib_object_MLX5_IB_OBJECT_DEVX_UMEM [mlx5_ib]
It appears that we are using the wrong modules? Do you know how to get the correct ones?
assuming mlnx_ofed 5.4 was really installed with its kernel modules (ofed_info|grep kernel
), pls try service openibd restart
or machine reboot
Ouch, it appears that mlnx-ofed did not install the modules. Trying to start openibd fails and I see many messages like openibd: ERROR: Module mlx5_ib belong to kernel which is not a part of MLNX_OFED, skipping...
for other modules as well...
Do you know what is the procedure/package to install to get the kernel modules from mlnx-ofed?
@heitorPB the default installation command should install kernel module as well. pls see more info here: https://docs.nvidia.com/networking/display/OFED510660/Installing+Mellanox+OFED, or reach out to NVIDIA networking support for further assistance with MLNX_OFED installation.
Thanks for the help @yosefe. We did follow that guide, but something is not right... I will continue digging.