Denis
                                            Denis
                                        
                                    @DrDaveD `rlx8_ompi_ucx.sif` is a typical container which defines all needed libraries and driver to run on MPI+UCX on our cluster. I do not think it is linked to that container...
@panda1100 we do not have system MPI. MPI is only installed within the container. We do have SLURM + PMIX installed on the system. The SLURM PMix plugin being responsible...
@panda1100 nope same problem with openMPI+ verbs only ...
Performing the `ib_send_lat` tests: ``` root@server:~# ib_send_lat ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send Latency Test Dual-port : OFF Device : mlx4_0 Number of qps :...
which application?, the one which triggers the errors?
This problem does not depend on the application but is linked to the openMPI+UCX layer. We observe this problem on different independent codes already. This can be related to this...
The simulation which triggers this errors was running on 8 separate nodes, each containing 8 AMD GPU MI 100 devices. It uses MPI+UCX to communicate between the GPUs intra- and...
BTW a particularity of our IB fabric is that we use everywhere HDR splitter cable and all HDR switch are turn on full splitting mode. Do you think this can...
No idea for the moment!
I did not do the tests `ib_send_lat` myself but i asked our sys. admin colleague to do the tests.