amrex icon indicating copy to clipboard operation
amrex copied to clipboard

enable GPU-aware MPI when performance conditions are met

Open BenWibking opened this issue 2 years ago • 21 comments

GPU-aware MPI significantly improves performance over the default host-pinned buffers in AMReX if two conditions are satisfied:

  1. CUDA_VISIBLE_DEVICES is not set. (e.g., when using SLURM, --gpu-bind=none)
  2. Managed memory is not used (setting amrex.the_arena_is_managed=0)

If both of these are satisfied, then OpenMPI (at least) uses CUDA IPC (using the UCX cuda_ipc transport) to perform device-to-device copies over NVLink. In my tests, this is significantly faster than using AMReX's host-pinned buffers on A100 + NVLink systems, leading to on-node scaling that is essentially perfect (95-99% scaling efficiency).

Scaling tests: https://github.com/BenWibking/quokka/pull/121 OpenMPI issue: https://github.com/open-mpi/ompi/issues/10871

BenWibking avatar Sep 29 '22 17:09 BenWibking

Thanks for the information! Nice scaling results.

In AMReX, we know the number of visible GPU devices. (https://github.com/AMReX-Codes/amrex/blob/e55d6b4f5375efb22ebed9b467878e301763073b/Src/Base/AMReX_GpuDevice.cpp#L163). I think we should be able to change the default at runtime based on that.

I can do a draft PR. Would you be able to help us to test? Thank you in advance.

WeiqunZhang avatar Sep 29 '22 17:09 WeiqunZhang

I can definitely help test. Thanks!

BenWibking avatar Sep 29 '22 17:09 BenWibking

This is one where the multi-node behavior may be significantly different than the intra-node behavior. It is also the case that different MPI implementations are different with respect to CUDA-aware MPI and networking hardware varies. So you would want to test this on multiple setups and at multiple scales before changing the default. (There is also the wrinkle that MPI builds may not always have GPU awareness turned on, although this is less frequent lately.)

maxpkatz avatar Sep 29 '22 18:09 maxpkatz

I've queued 8-node and 64-node runs on NCSA Delta (which has Slingshot, but only 1 NIC per node). Unfortunately, this cluster only has OpenMPI (it's supposed to have the Cray environment, eventually). Other people will have to test other configurations.

Edit: On 8 nodes, CUDA-aware is still a win, but only by 3%. This might be due to the fact that there is only 1 NIC per node, but 4 GPUs per node on this system.

BenWibking avatar Sep 29 '22 18:09 BenWibking

I did some tests on perlmutter. On a single node, the communication was about 10-20% faster with cuda-aware mpi. But on 8 nodes, it was actually slightly slower.

WeiqunZhang avatar Sep 29 '22 22:09 WeiqunZhang

That's interesting. I assume this was with Cray MPI?

I will be able to test on Infiniband + V100 + OpenMPI tomorrow.

BenWibking avatar Sep 30 '22 00:09 BenWibking

On V100/Infiniband/OpenMPI, GPU-aware on a single node is a significant improvement, and on 8 nodes it is a ~3% improvement. So on all the machines I currently have access to, GPU-aware always wins.

It would be good to know if this is something only seen on OpenMPI, or if there's another explanation. ~~Also unknown whether this applies to AMD devices at all.~~

Edit: On a 8x MI100 node with OpenMPI, GPU-awareness improves performance by ~10% compared to host pinned buffers. I don't have access to a multi-node AMD system to test the multi-node case. GPU-aware performance does not appear to be affected by the GPU binding.

BenWibking avatar Sep 30 '22 15:09 BenWibking

I did some tests on perlmutter. On a single node, the communication was about 10-20% faster with cuda-aware mpi. But on 8 nodes, it was actually slightly slower.

@WeiqunZhang Does lowering the value of MPICH_GPU_IPC_THRESHOLD change this result? It looks like the default is 8192: https://www.olcf.ornl.gov/wp-content/uploads/2021/04/HPE-Cray-MPI-Update-nfr-presented.pdf

BenWibking avatar Oct 01 '22 01:10 BenWibking

On 64 nodes on NCSA Delta, I get a 13% performance improvement with CUDA-aware MPI over host pinned buffers on a hydro test problem. This is with OpenMPI+UCX for now. It will be interesting to see whether Cray MPI performance is significantly different.

BenWibking avatar Oct 01 '22 21:10 BenWibking

Which NCSA Delta partition was this? The 4 GPU A100 nodes, or a different partition?

kngott avatar Oct 03 '22 17:10 kngott

Which NCSA Delta partition was this? The 4 GPU A100 nodes, or a different partition?

This was the A100x4 partition.

BenWibking avatar Oct 03 '22 19:10 BenWibking

👍 Do you also happen to know (or can you find out) how many NICs it has per node and can you confirm that's Slingshot 10?

kngott avatar Oct 03 '22 19:10 kngott

👍 Do you also happen to know (or can you find out) how many NICs it has per node and can you confirm that's Slingshot 10?

It has 1 NIC per node. I'll check whether it's Slingshot 10 or SS11, not sure about that offhand.

BenWibking avatar Oct 03 '22 19:10 BenWibking

It's currently Slingshot 10.

BenWibking avatar Oct 03 '22 19:10 BenWibking

Makes sense, thanks!

So, sounds like the strongest possibilities are either affinity differences, or the OpenMPI+UCX implementation of CUDA-Aware is better. It would be really good to lock down the causes so AMReX can make some informed decisions and we could pass this along to the NERSC and/or Illinois teams to start some adjustments and discussions.

OpenMPI+UCX currently doesn't exist on Perlmutter. Is there an MPICH implementation on NCSA Delta?

One other general thing: We should probably make sure we're testing comms with amrex.use_profiler_syncs = 1 (#2762). That turns on a sync immediately before FillBoundary, ParallelCopy and Redistribute to make sure the corresponding comms timers accurately measure the comm performance and not performance variations elsewhere that are just captured in the comm timers because of their internal sync points.

kngott avatar Oct 03 '22 20:10 kngott

OpenMPI+UCX currently doesn't exist on Perlmutter. Is there an MPICH implementation on NCSA Delta?

No, not at the moment.

One other general thing: We should probably make sure we're testing comms with amrex.use_profiler_syncs = 1 (#2762). That turns on a sync immediately before FillBoundary, ParallelCopy and Redistribute to make sure the corresponding comms timers accurately measure the comm performance and not performance variations elsewhere that are just captured in the comm timers because of their internal sync points.

For my case, I've been comparing the total cell updates for a hydro test problem, rather than looking at the comm time itself.

BenWibking avatar Oct 03 '22 20:10 BenWibking

For the use_profiler_syncs=1 : Makes sense to me. Just a note for us for future testing.

For the MPICH: yeah, that tracks: two systems, each with a different, unique MPI implementation getting different results. Couldn't be easy, could it? 😄

Thanks for all your work on this, Ben!

kngott avatar Oct 03 '22 20:10 kngott

I've obtained access and run tests on Crusher. I can share those over email.

BenWibking avatar Oct 06 '22 19:10 BenWibking

At least for current-generation GPU systems, it appears the root cause of the issue is cgroup isolation of GPUs on the same node preventing use of CUDA/ROCm IPC: https://github.com/open-mpi/ompi/issues/11949.

@WeiqunZhang Since there is the communication arena now, does it make sense to enable GPU-aware MPI by default in AMReX?

BenWibking avatar Oct 01 '23 12:10 BenWibking

This issue is more complex than whether or not IPC is available, since a large number of cases relevant to AMReX users are going to be on large (100s to 1000s of servers) configurations so it's a balance of effects between IPC and RDMA. The default should be determined by looking at both small scale (~10 servers) and large scale (~100-1000 servers) benchmark runs on the major GPU systems we care about and determining which is better on average.

maxpkatz avatar Oct 01 '23 15:10 maxpkatz

Maybe we can add a build option that allows people to change the default at build time.

WeiqunZhang avatar Oct 06 '23 16:10 WeiqunZhang