Edgar Gabriel comments

Results 137 comments of


                                            Edgar Gabriel

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

ok, this worked now, thanks! I ran it on 2 nodes, each node with 4 MI100 GPUs + InfiniBand, and it seemed to work without issues (UCX 1.15.0, Open MPI...

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

unfortunately not, I cannot update the rocm version on that cluster, that is not under my control. However, the ROCm version plays a miniscule role in this scenario in my...

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

I can confirm that the 16 process run (2 nodes with 8 processes/GPUs each of MI100) also finished correctly (though the job complained about too many resources and too little...

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

it was because of issues making the GPUDirectRDMA work. We have GPUDIrectRDMA using rdma-core working meanwhile on a cluster that uses Broadcomm NICs, but for InfiniBand/Mellanox RoCE HCAs we always...

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

@denisbertini I submitted a job with the new syntax, will update the ticket once the job finishes. The issue that you are pointing to was a problem that we had...

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

I reran the code with the changed settings/arguments, the code still finished correctly on our system ``` export GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1" cd /work1/amd/egabriel/WARPX/bin /home1/egabriel/OpenMPI/bin/mpirun --mca pml ucx -np 16 -x UCX_LOG_LEVEL=info ./warpx.3d.MPI.HIP.DP.PDP.OPMD.QED...

Edgar Gabriel

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

Performance regression UCX 1.10.0 -> 1.14.0 on AMD system

UCP/WIREUP: Increase UCP_MAX_LANES to 64

Building with ROCm/HIP fails on a system without GPU

Building with ROCm/HIP fails on a system without GPU