Edgar Gabriel
Edgar Gabriel
ok, this worked now, thanks! I ran it on 2 nodes, each node with 4 MI100 GPUs + InfiniBand, and it seemed to work without issues (UCX 1.15.0, Open MPI...
unfortunately not, I cannot update the rocm version on that cluster, that is not under my control. However, the ROCm version plays a miniscule role in this scenario in my...
I can confirm that the 16 process run (2 nodes with 8 processes/GPUs each of MI100) also finished correctly (though the job complained about too many resources and too little...
it was because of issues making the GPUDirectRDMA work. We have GPUDIrectRDMA using rdma-core working meanwhile on a cluster that uses Broadcomm NICs, but for InfiniBand/Mellanox RoCE HCAs we always...
@denisbertini I submitted a job with the new syntax, will update the ticket once the job finishes. The issue that you are pointing to was a problem that we had...
I reran the code with the changed settings/arguments, the code still finished correctly on our system ``` export GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1" cd /work1/amd/egabriel/WARPX/bin /home1/egabriel/OpenMPI/bin/mpirun --mca pml ucx -np 16 -x UCX_LOG_LEVEL=info ./warpx.3d.MPI.HIP.DP.PDP.OPMD.QED...
@arun-chandran-edarath you are on the CPU side of the operations, can you find out what the recommendation from our side would be?
Would it make sense to have this parameter as a configure option?
@lahwaacz thank you for the bug report, we will look into this. The UCC CI checker runs through exactly the same scenario (i.e. compiling UCC with the ROCm stack installed...
@romintomasetti thank you, it is on our list, we definitely plan to have it fixed for the next release. I think the fix is not entirely trivial since cuda_lt.sh is...