Denis comments

Results 115 comments of


                                            Denis

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

could you try with ROCm 6.0 ?

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

Could you ask the sys. admin of your cluster, the reason(s) why they had to move from rdma-core to official MOFED library? Will be very interesting to know for us...

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

Could you please try again the test with the modified submit script: ``` # GPU-aware MPI optimizations GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1" # executable & inputs file EXE=warpx_3d INPUTS=inputs_3d.txt srun --export=ALL --cpu-bind=cores ${EXE} ${INPUTS}...

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address

@edgargabriel Redoing the same test gives another type of error: ``` [lxbk1087:2070485:0:2070485] rndv.c:1872 Assertion `sreq->send.rndv.lanes_count > 0' failed ==== backtrace (tid:2070485) ==== 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f4132b0f0a4] 1 /usr/local/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xb0) [0x7f4132b0c070] 2 /usr/local/ucx/lib/libucs.so.0(+0x2a151)...

Rank to GPU mapping warning

using `--gpus-per-task=1` instead of `--gres=gpu:8` did not help.

Rank to GPU mapping warning

Can i just ignore this warning or my job will have unbalanced workload on GPU device ? How to change that mapping ?

Rank to GPU mapping warning

I run now my simulation on 8 nodes each having 8 GPUs devices. From the WarpX output i see the following ``` Initializing AMReX (24.01)... MPI initialized with 64 MPI...

Rank to GPU mapping warning

Ah it is about affinity optimisation then ... not one MPI rank using 2 GPUs for example

Rank to GPU mapping warning

but what if i use the following slurm option: ``` --gpu-bind verbose,closest ``` ?

Rank to GPU mapping warning

Other question, i do not see in the warpx example usage of `mem-per-gpu` slurm option, is there any reason for that ?