WarpX
WarpX copied to clipboard
Refactor Single-Precision Comms
To optimize communication across MPI ranks when running WarpX with the PSATD solver, data is recently cast to single precision-before float before copying which saves bandwidth. However, a temporary multifab is first created which causes additional overhead.
Since AMReX #2708, FillBoundary
allows for specialization with the buffer type. Depending on the global variable WarpX::do_single_precision_comms
, the code should now call something like FillBoundary<float>
which concerns mostly the file Source/ablastr/utils/Communication.cpp
(see also #3167).
(Thank you, @WeiqunZhang and @atmyers for clarification!)
Replacing the first is straightforward
FillBoundary usage
void FillBoundary (amrex::MultiFab &mf, bool do_single_precision_comms, const amrex::Periodicity &period)
{
BL_PROFILE("ablastr::utils::communication::FillBoundary");
if (do_single_precision_comms)
{
amrex::FabArray<amrex::BaseFab<comm_float_type> > mf_tmp(mf.boxArray(),
mf.DistributionMap(),
mf.nComp(),
mf.nGrowVect());
mixedCopy(mf_tmp, mf, 0, 0, mf.nComp(), mf.nGrowVect());
mf_tmp.FillBoundary(period);
mixedCopy(mf, mf_tmp, 0, 0, mf.nComp(), mf.nGrowVect());
}
else
{
mf.FillBoundary(period);
}
}
Replacing the second is not immediately as easy since FillBoundaryAndSync
does not support template specialization with the buffer type, yet. Changes will affect AMReX as well.
FillBoundaryAndSync usage
void FillBoundary(amrex::MultiFab &mf,
amrex::IntVect ng,
bool do_single_precision_comms,
const amrex::Periodicity &period,
const bool nodal_sync)
{
BL_PROFILE("ablastr::utils::communication::FillBoundary");
if (do_single_precision_comms)
{
amrex::FabArray<amrex::BaseFab<comm_float_type> > mf_tmp(mf.boxArray(),
mf.DistributionMap(),
mf.nComp(),
mf.nGrowVect());
mixedCopy(mf_tmp, mf, 0, 0, mf.nComp(), mf.nGrowVect());
if (nodal_sync) {
mf_tmp.FillBoundaryAndSync(0, mf.nComp(), ng, period);
} else {
mf_tmp.FillBoundary(ng, period);
}
mixedCopy(mf, mf_tmp, 0, 0, mf.nComp(), mf.nGrowVect());
}
else
{
if (nodal_sync) {
mf.FillBoundaryAndSync(0, mf.nComp(), ng, period);
} else {
mf.FillBoundary(ng, period);
}
}
}
ToDo
- [x] make simple change in
FillBoundary
call in WarpX - [ ] also expose in AMReX and use in WarpX:
- [ ]
FillBoundaryAndSync
- [ ]
ParallelCopy
- [ ]
SumBoundary
- [ ]
- [x] compile
- [x] find / set up bigger example where communication is demanding to see performance improvement
- [ ] PR for simple change
- [ ] make changes in AMReX for buffer type specialization of other Comm utilities
- [ ] change calls in WarpX
- [ ] compile
- [ ] run & analyze performance
In two very tiny LWFA example simulations, the new version was actually slower but they were probably too small to show anything of relevance.
(warpx) mgarten@perlmutter:login04:/pscratch/sd/m/mgarten/002_LWFA_comms_optimized> cat output.txt | grep --color Total
Total Time : 5.424853978
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]
(warpx) mgarten@perlmutter:login04:/pscratch/sd/m/mgarten/002_LWFA_comms_optimized> cat ../001_LWFA/output.txt | grep --color Total
Total Time : 5.120690773
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]
Do we also want/need this in Source/BoundaryConditions/PML_RZ.cpp
?
The calls to amrex::FillBoundary
in PML_RZ.cpp
should be replaced with calls to WarpXCommUtil::FillBoundary
. I can do this is a separate PR.
https://github.com/ECP-WarpX/WarpX/issues/3188#issuecomment-1158177482
they were probably too small to show anything of relevance
Please scale up the example by using more resolution (cells). Aim for 256^3 cells per Perlmutter GPU.
This is the PR that added the FillBoundary<float>
option to AMReX: https://github.com/AMReX-Codes/amrex/pull/2708
We should be able to update the other functions by making the analogous changes to them.
I ran the LWFA test cases in a larger configuration of 512 x 256 x 2048 cells, such that every one of the 16 GPUs gets 256^3 cells on average. I also removed all diagnostic output. Curiously, it was still slower than pre- #3190.
mgarten@perlmutter:login30:/pscratch/sd/m/mgarten/007_LWFA_upscaled_no_output> cat pre_PR3190/output.txt | grep Total
Total Time : 574.9063498
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]
mgarten@perlmutter:login30:/pscratch/sd/m/mgarten/007_LWFA_upscaled_no_output> cat post_PR3190/output.txt | grep Total
Total Time : 600.3614205
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]