ucx
ucx copied to clipboard
UCX ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory
While trying to run an application, i get following issue - [0] MPI startup(): libfabric provider: mlx [1645552289.027611] [m7-31:723275:0] ib_mlx5_dv.c:161 UCX ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory Abort(1091215) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(138)........: MPID_Init(1139)..............:
OS: RHEL 8.4 UCT version=1.11.1 revision c58db6b intel mpi - 2019u12 ,and openmpi 4.1.2
the issue shows up only when i subscribe all the available cores on node (128), i,e. - mpirun -np 512 -ppn 128 ./executable
combinations like - mpirun -np 256 -ppn 64 ./executable mpirun -np 128 -ppn 32 ./executable
work without issues, please advice. UCX_TLS is set to - self,sm,rc_x.
please advice
setting UCX_IB_NUM_PATH=1 did not help.
@puneet336 can you please run the test with UCX_LOG_LEVEL=info and provide the output?
I have changed the MPI compiler to openmpi 4.1.2, and i get same issue with UCX_TLS=self,sm,rc_x. error log is here- https://github.com/puneet336/MOM5/blob/main/logs01
It's expected to fail with UCX_TLS=self,sm,rc_x, on large scale.
However, when not setting UCX_TLS explicitly, dc or ud transports should be selected (rather than rc) on large scale.
The question is if there is any other component in the system/app that changes/forces the default value of UCX_TLS. This is something we could see by running with UCX_LOG_LEVEL=info.