Results 115 comments of Denis

@DrDaveD `rlx8_ompi_ucx.sif` is a typical container which defines all needed libraries and driver to run on MPI+UCX on our cluster. I do not think it is linked to that container...

@panda1100 we do not have system MPI. MPI is only installed within the container. We do have SLURM + PMIX installed on the system. The SLURM PMix plugin being responsible...

@panda1100 nope same problem with openMPI+ verbs only ...

Performing the `ib_send_lat` tests: ``` root@server:~# ib_send_lat ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send Latency Test Dual-port : OFF Device : mlx4_0 Number of qps :...

which application?, the one which triggers the errors?

This problem does not depend on the application but is linked to the openMPI+UCX layer. We observe this problem on different independent codes already. This can be related to this...

The simulation which triggers this errors was running on 8 separate nodes, each containing 8 AMD GPU MI 100 devices. It uses MPI+UCX to communicate between the GPUs intra- and...

BTW a particularity of our IB fabric is that we use everywhere HDR splitter cable and all HDR switch are turn on full splitting mode. Do you think this can...

I did not do the tests `ib_send_lat` myself but i asked our sys. admin colleague to do the tests.