WarpX icon indicating copy to clipboard operation
WarpX copied to clipboard

Rank to GPU mapping warning

Open denisbertini opened this issue 1 year ago • 21 comments

Using the latest ( 24.01) WarpX version i get these warning on the 8 AMD-GPU node using exactly 8 MPI ranks:

Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.!

SLURM submit script is the following:

sbatch  --reservation=dbertini --nodes 1 --ntasks-per-node 8 --cpus-per-task 12 --gres=gpu:8 --gpu-bind verbose,closest --mem-per-gpu 48000 --no-requeue --job-name warpx  --mail-type ALL --mail-user [email protected] --partition gpu --time 7-0:00:00 -D ./ -o %j.out.log -e %j.err.log   ./run-file.sh

Any idea what could be wrong here ?

denisbertini avatar Jan 09 '24 09:01 denisbertini

It's a warning. The issue is each MPI process can see 8 GPUs. In that case, we try our best to assign GPUs to MPI processes. However, the mapping may not be optimal. For example, a GPU might be assigned to a CPU far away.

You could try --gpus-per-task=1 instead of --gres=gpu:8 to see if it helps.

WeiqunZhang avatar Jan 09 '24 17:01 WeiqunZhang

using --gpus-per-task=1 instead of --gres=gpu:8 did not help.

denisbertini avatar Jan 09 '24 20:01 denisbertini

Can i just ignore this warning or my job will have unbalanced workload on GPU device ? How to change that mapping ?

denisbertini avatar Jan 10 '24 16:01 denisbertini

You could ignore it. It's not a correctness issue, but the performance may not be optimal.

You might be able to use ROCR_VISIBLE_DEVICES to control which GPU is visible. For example, on perlmutter (which has 4 gpus per node), one could do something like, https://warpx.readthedocs.io/en/latest/install/hpc/perlmutter.html#id1

# CUDA visible devices are ordered inverse to local task IDs
#   Reference: nvidia-smi topo -m
srun --cpu-bind=cores bash -c "
    export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID));
    ${EXE} ${INPUTS} ${GPU_AWARE_MPI}" \
  > output.txt

The problem is the mappings are different on different machines. You may not want to use the perlmutter approach on other machines. For example, on Frontier, GPU 0 and 1 are closest to NUMA3, GPU 2 and 3 are closet to NUMA1, GPU 4 and 5 are closest to NUMA0 and GPU 6 and 7 are closest to NUMA2. So you should try to read the user guide for your system or ask the system's administrators for help.

WeiqunZhang avatar Jan 10 '24 17:01 WeiqunZhang

I run now my simulation on 8 nodes each having 8 GPUs devices. From the WarpX output i see the following

Initializing AMReX (24.01)...
MPI initialized with 64 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 64 devices.
AMReX (24.01) initialized
PICSAR (23.09)
WarpX (24.01)

    __        __             __  __
    \ \      / /_ _ _ __ _ __\ \/ /
     \ \ /\ / / _` | '__| '_ \\  /
      \ V  V / (_| | |  | |_) /  \
       \_/\_/ \__,_|_|  | .__/_/\_\
                        |_|

Level 0: dt = 5.471961054e-15 ; dx = 1.937984496e-06 ; dy = 7.352941176e-05 ; dz = 3.680147059e-05

Grids Summary:
  Level 0   64 grids  2576862720 cells  100 % of domain
            smallest grid: 967 x 204 x 204  biggest grid: 968 x 204 x 204

So 64 MPI rank for 8*8 GPUs... But how to be sure that only one MPI rank matches exaclty 1 GPU devices? From the warning it is unclear ...

denisbertini avatar Jan 10 '24 17:01 denisbertini

The warning is not about multiple processes use the same GPU. It's about a process sees multiple GPU. Suppose there are two CPUs, one in north pole and the other in south pole, and there are also two GPUs, one in north pole and the other in south pole. If each process only see one GPU, then there is no choice to make and WarpX will not issue a warning even if the north pole CPU sees only the south pole GPU. If each process sees two GPUs, we will make a choice. But it might be the wrong choice that uses the north pole GPU on the south pole CPU.

WeiqunZhang avatar Jan 10 '24 17:01 WeiqunZhang

Ah it is about affinity optimisation then ... not one MPI rank using 2 GPUs for example

denisbertini avatar Jan 10 '24 17:01 denisbertini

Right. It's about affinity between CPUs and GPUs within a node. Unless there are more processes per node than the number of GPUs per node, we will make sure each process uses a unique GPU.

WeiqunZhang avatar Jan 10 '24 17:01 WeiqunZhang

but what if i use the following slurm option:

--gpu-bind verbose,closest

?

denisbertini avatar Jan 10 '24 18:01 denisbertini

Other question, i do not see in the warpx example usage of mem-per-gpu slurm option, is there any reason for that ?

denisbertini avatar Jan 10 '24 18:01 denisbertini

Indeed when using ROCR_VISIBLE_DEVICES the warning is gone ...

denisbertini avatar Jan 10 '24 18:01 denisbertini

Make sure the right GPU is visible.

WeiqunZhang avatar Jan 10 '24 18:01 WeiqunZhang

i am using the same script as proposed in the documentation:

# CUDA visible devices are ordered inverse to local task IDs
#   Reference: nvidia-smi topo -m
srun --cpu-bind=cores bash -c "
    export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID));
    ${EXE} ${INPUTS} ${GPU_AWARE_MPI}" \
  > output.tx

and no warning anymore, so i suppose the mapping is now correct...

denisbertini avatar Jan 10 '24 19:01 denisbertini

For perlmutter, the mapping is gpu 0 is mapped to slurm cpu id 3, and so on.

If each process only see one GPU, then there is no choice to make and WarpX will not issue a warning even if the north pole CPU sees only the south pole GPU.

No warning does not mean it's correct. It's just that it's not WarpX's fault anymore even if it's incorrect, because warpx cannot do anything else.

WeiqunZhang avatar Jan 10 '24 19:01 WeiqunZhang

In fact, ignoring the warning might have a better chance of being right than using ROCR_VISIBLE_DEVICES without knowing how the hardware in the system is mapping.

WeiqunZhang avatar Jan 10 '24 20:01 WeiqunZhang

The problem is the mappings are different on different machines. You may not want to use the perlmutter approach on other machines. For example, on Frontier, GPU 0 and 1 are closest to NUMA3, GPU 2 and 3 are closet to NUMA1, GPU 4 and 5 are closest to NUMA0 and GPU 6 and 7 are closest to NUMA2. So you should try to read the user guide for your system or ask the system's administrators for help.

WeiqunZhang avatar Jan 10 '24 20:01 WeiqunZhang

Another thing, in both cases when using

GPU_AWARE_MPI=amrex.use_gpu_aware_mpi=1

option, the job crashes with UCX errors:

[1704917108.050163] [lxbk1120:1097129:0]           ib_md.c:309  UCX  ERROR ibv_reg_mr(address=0x7f56dd327140, length=6528, access=0x10000f) failed: Invalid argument
[1704917108.050189] [lxbk1120:1097129:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7f56dd327140 (rocm) length 6528 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
[1704917108.050192] [lxbk1120:1097129:0]     ucp_request.c:555  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7f56dd327140 len 6528: Input/output error

Do you have any idea what can be wrong ? I use GPU aware MPI ( 5.0.1 ) compile with UCX ( 1.15.0 )

denisbertini avatar Jan 10 '24 20:01 denisbertini

without GPU_AWARE_MPI works perfectly. does it means that GPU AWARE MPI is not being used in AMREX ?

denisbertini avatar Jan 10 '24 20:01 denisbertini

Right, gpu aware mpi is not used by default.

As for the errors you have observed, I have no clues. Each system is different. Maybe there are certain modules you have to load when you compile. Maybe there are some environment variables you need to set.

WeiqunZhang avatar Jan 10 '24 20:01 WeiqunZhang

@denisbertini Note that the $((3-SLURM_LOCALID)) logic you found is for Perlmutter (NERSC). There, we @kngott manually checked the order of exposed GPUs and their locality to Slurm MPI process placement (e.g., via hwloc or nvidia-smi topo -m). It is a limitation of the NERSC Slurm configuration that the order is not configured to pin the closest GPU to the closest CPU, thus we manually invert the order of GPU numbers paired with MPI ranks (processes) on the node.

This is likely different on the cluster you are working on and you have to manually check the topology / consult with system admins.

Crashing GPU-aware MPI is often a symptom of wrong affinity between MPI process and assigned GPU. But even if GPU-aware MPI does not crash, it does not mean that you pinned ideally. (Check the topology of your cluster. Ask your cluster admin what their recommended Slurm options are for a 1:1 MPI-rank-to-GPU mapping.)

@denisbertini I cannot find it in your description yet: which HPC system are you referring to?

ax3l avatar Feb 06 '24 18:02 ax3l

Our "HPC" system is very similar to SPOCK. We just have 8 AMD GPUs instead of 4 / node. For the record, i investigated in details with the openUCX guys the GPU AWARE MPI problem and it turns to be related to our system configuration, See: https://github.com/openucx/ucx/issues/9589#issuecomment-1912110836

denisbertini avatar Feb 06 '24 19:02 denisbertini