WarpX Refactor Single-Precision Comms

To optimize communication across MPI ranks when running WarpX with the PSATD solver, data is recently cast to single precision-before float before copying which saves bandwidth. However, a temporary multifab is first created which causes additional overhead.
Since AMReX #2708, FillBoundary allows for specialization with the buffer type. Depending on the global variable WarpX::do_single_precision_comms, the code should now call something like FillBoundary<float> which concerns mostly the file Source/ablastr/utils/Communication.cpp (see also #3167). (Thank you, @WeiqunZhang and @atmyers for clarification!)

Replacing the first is straightforward

FillBoundary usage

void FillBoundary (amrex::MultiFab &mf, bool do_single_precision_comms, const amrex::Periodicity &period)
{
    BL_PROFILE("ablastr::utils::communication::FillBoundary");

    if (do_single_precision_comms)
    {
        amrex::FabArray<amrex::BaseFab<comm_float_type> > mf_tmp(mf.boxArray(),
                                                                 mf.DistributionMap(),
                                                                 mf.nComp(),
                                                                 mf.nGrowVect());

        mixedCopy(mf_tmp, mf, 0, 0, mf.nComp(), mf.nGrowVect());

        mf_tmp.FillBoundary(period);

        mixedCopy(mf, mf_tmp, 0, 0, mf.nComp(), mf.nGrowVect());
    }
    else
    {
        mf.FillBoundary(period);
    }
}

Replacing the second is not immediately as easy since FillBoundaryAndSync does not support template specialization with the buffer type, yet. Changes will affect AMReX as well.

FillBoundaryAndSync usage

void FillBoundary(amrex::MultiFab &mf,
                  amrex::IntVect ng,
                  bool do_single_precision_comms,
                  const amrex::Periodicity &period,
                  const bool nodal_sync)
{
    BL_PROFILE("ablastr::utils::communication::FillBoundary");

    if (do_single_precision_comms)
    {
        amrex::FabArray<amrex::BaseFab<comm_float_type> > mf_tmp(mf.boxArray(),
                                                            mf.DistributionMap(),
                                                            mf.nComp(),
                                                            mf.nGrowVect());

        mixedCopy(mf_tmp, mf, 0, 0, mf.nComp(), mf.nGrowVect());

        if (nodal_sync) {
            mf_tmp.FillBoundaryAndSync(0, mf.nComp(), ng, period);
        } else {
            mf_tmp.FillBoundary(ng, period);
        }

        mixedCopy(mf, mf_tmp, 0, 0, mf.nComp(), mf.nGrowVect());
    }
    else
    {

        if (nodal_sync) {
            mf.FillBoundaryAndSync(0, mf.nComp(), ng, period);
        } else {
            mf.FillBoundary(ng, period);
        }
    }
}

ToDo

[x] make simple change in FillBoundary call in WarpX
[ ] also expose in AMReX and use in WarpX:
- [ ] FillBoundaryAndSync
- [ ] ParallelCopy
- [ ] SumBoundary
[x] compile
[x] find / set up bigger example where communication is demanding to see performance improvement
[ ] PR for simple change
[ ] make changes in AMReX for buffer type specialization of other Comm utilities
[ ] change calls in WarpX
[ ] compile
[ ] run & analyze performance

Jun 16 '22 22:06 n01r

In two very tiny LWFA example simulations, the new version was actually slower but they were probably too small to show anything of relevance.

(warpx) mgarten@perlmutter:login04:/pscratch/sd/m/mgarten/002_LWFA_comms_optimized> cat output.txt | grep --color Total
Total Time                     : 5.424853978
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]
(warpx) mgarten@perlmutter:login04:/pscratch/sd/m/mgarten/002_LWFA_comms_optimized> cat ../001_LWFA/output.txt | grep --color Total
Total Time                     : 5.120690773
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]

Jun 16 '22 22:06 n01r

Do we also want/need this in Source/BoundaryConditions/PML_RZ.cpp?

Jun 16 '22 22:06 n01r

The calls to amrex::FillBoundary in PML_RZ.cpp should be replaced with calls to WarpXCommUtil::FillBoundary. I can do this is a separate PR.

Jun 16 '22 22:06 dpgrote

https://github.com/ECP-WarpX/WarpX/issues/3188#issuecomment-1158177482

they were probably too small to show anything of relevance

Please scale up the example by using more resolution (cells). Aim for 256^3 cells per Perlmutter GPU.

Jun 20 '22 07:06 ax3l

This is the PR that added the FillBoundary<float> option to AMReX: https://github.com/AMReX-Codes/amrex/pull/2708

We should be able to update the other functions by making the analogous changes to them.

Jun 21 '22 16:06 atmyers

I ran the LWFA test cases in a larger configuration of 512 x 256 x 2048 cells, such that every one of the 16 GPUs gets 256^3 cells on average. I also removed all diagnostic output. Curiously, it was still slower than pre- #3190.

mgarten@perlmutter:login30:/pscratch/sd/m/mgarten/007_LWFA_upscaled_no_output> cat pre_PR3190/output.txt | grep Total
Total Time                     : 574.9063498
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]
mgarten@perlmutter:login30:/pscratch/sd/m/mgarten/007_LWFA_upscaled_no_output> cat post_PR3190/output.txt | grep Total
Total Time                     : 600.3614205
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]

Jun 23 '22 23:06 n01r

WarpX WarpX copied to clipboard

Refactor Single-Precision Comms

ToDo

WarpX
WarpX copied to clipboard