ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCX ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory

Open puneet336 opened this issue 2 years ago • 3 comments

While trying to run an application, i get following issue - [0] MPI startup(): libfabric provider: mlx [1645552289.027611] [m7-31:723275:0] ib_mlx5_dv.c:161 UCX ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory Abort(1091215) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(138)........: MPID_Init(1139)..............:

OS: RHEL 8.4 UCT version=1.11.1 revision c58db6b intel mpi - 2019u12 ,and openmpi 4.1.2

the issue shows up only when i subscribe all the available cores on node (128), i,e. - mpirun -np 512 -ppn 128 ./executable

combinations like - mpirun -np 256 -ppn 64 ./executable mpirun -np 128 -ppn 32 ./executable

work without issues, please advice. UCX_TLS is set to - self,sm,rc_x.

please advice

setting UCX_IB_NUM_PATH=1 did not help.

puneet336 avatar Feb 22 '22 18:02 puneet336

@puneet336 can you please run the test with UCX_LOG_LEVEL=info and provide the output?

yosefe avatar Feb 23 '22 09:02 yosefe

I have changed the MPI compiler to openmpi 4.1.2, and i get same issue with UCX_TLS=self,sm,rc_x. error log is here- https://github.com/puneet336/MOM5/blob/main/logs01

puneet336 avatar Mar 02 '22 11:03 puneet336

It's expected to fail with UCX_TLS=self,sm,rc_x, on large scale. However, when not setting UCX_TLS explicitly, dc or ud transports should be selected (rather than rc) on large scale. The question is if there is any other component in the system/app that changes/forces the default value of UCX_TLS. This is something we could see by running with UCX_LOG_LEVEL=info.

yosefe avatar Mar 02 '22 12:03 yosefe