impactx icon indicating copy to clipboard operation
impactx copied to clipboard

Optimization of Comms for Single Box

Open ax3l opened this issue 4 months ago • 3 comments

For simple simulations of one process and box, no MR, ImpactX can be optimized to perform less time in "communication" routines such as particle redistribution.

This is important, because:

  • we have many workflows that can run with one process (e.g., 1 GPU or 1 process + OMP threads)
  • many people will benchmark ImpactX against non-MPI implemented legacy codes.

Here is a reproducer where we spend too much time.

Configure & Build

As of 25.07, all relative to the repositories root directory:

cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=OFF
cmake --build build -j 12

Run w/o MPI

A typical case is a space-charge simulation w/o MR, e.g., this input: expanding_drift_fft.txt

./build/bin/impactx expanding_drift_fft.txt
Initializing AMReX (25.07-32-gaf07c6f1d7b8)...
OMP initialized with 14 OMP threads
AMReX (25.07-32-gaf07c6f1d7b8) initialized

Grids Summary:
  Level 0   1 grids  5120 cells  100 % of domain

...
--------------------------------------------------------------------------------------------------------------
Name                                                           NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------------------
ParticleContainer::RedistributeCPU()                               41      6.391      6.391      6.391  19.70%
ImpactX::add_particles                                              1      5.006      5.006      5.006  15.43%
impactx::spacecharge::GatherAndPush                                40      4.898      4.898      4.898  15.10%
ImpactXParticleContainer::DepositCharge                            40       2.72       2.72       2.72   8.38%
ablastr::particles::deposit_charge::ChargeDeposition               84      2.442      2.442      2.442   7.53%
impactx::transformation::CoordinateTransformation                  80      1.794      1.794      1.794   5.53%
...

In this simulation, we use a mesh of constant number of cells $N_{x,y,z}$, but resize the extend (or $d_{x,y,z}$) to always fit the particles in it. Consequently, particles will never be removed/marked as invalid during redistribute. If helpful, we could also pass this guarantee to AMReX into the redistribute API.

Run w/ 1 MPI Rank

This would be a second step and is less urgent.

In quick tests, the performance drops a little further if MPI is enabled (but not "used" with more than one process), which hints that we might need to add a runtime check for the size of the box array being 1:

cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=ON
cmake --build build -j 12

./build/bin/impactx expanding_drift_fft.txt
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
OMP initialized with 14 OMP threads
AMReX (25.07-32-gaf07c6f1d7b8) initialized

Grids Summary:
  Level 0   1 grids  5120 cells  100 % of domain

...
--------------------------------------------------------------------------------------------------------------
Name                                                           NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------------------
ParticleContainer::RedistributeCPU()                               41      7.096      7.096      7.096  21.27%
ImpactX::add_particles                                              1      4.951      4.951      4.951  14.84%
impactx::spacecharge::GatherAndPush                                40      4.827      4.827      4.827  14.46%
ImpactXParticleContainer::DepositCharge                            40       2.64       2.64       2.64   7.91%
ablastr::particles::deposit_charge::ChargeDeposition               88      2.543      2.543      2.543   7.62%
impactx::Push::Drift                                               40      1.739      1.739      1.739   5.21%
impactx::transformation::CoordinateTransformation                  80      1.702      1.702      1.702   5.10%

...

ax3l avatar Jul 30 '25 15:07 ax3l

Test on my laptop's GPU:

Run w/o MPI

cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=OFF -DImpactX_COMPUTE=CUDA -DImpactX_PRECISION=SINGLE
cmake --build build -j 12
Initializing AMReX (25.07-37-g08f25e1f7ccb-dirty)...
Initializing CUDA...
CUDA initialized with 1 device.
AMReX (25.07-37-g08f25e1f7ccb-dirty) initialized

Grids Summary:
  Level 0   1 grids  5120 cells  100 % of domain

Beam kinetic energy (MeV): 250
Bunch charge (C): 9.999999717e-10
Particle type: electron
Number of particles: 30000000
...

TinyProfiler total time across processes [min...avg...max]: 5.555 ... 5.555 ... 5.555

--------------------------------------------------------------------------------------------------------------
Name                                                           NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------------------
ImpactXParticleContainer::DepositCharge                            40      3.368      3.368      3.368  60.64%
impactx::transformation::CoordinateTransformation                  80      0.771      0.771      0.771  13.88%
impactx::Push::Drift                                               40     0.3451     0.3451     0.3451   6.21%
impactx::spacecharge::GatherAndPush                                40      0.313      0.313      0.313   5.63%
Redistribute_partition                                             41     0.2518     0.2518     0.2518   4.53%
FFT::R2C                                                            1      0.179      0.179      0.179   3.22%
ImpactXParticleContainer::MinAndMaxPositions                       41    0.09687    0.09687    0.09687   1.74%
impactX::collect_lost_particles                                    40    0.05956    0.05956    0.05956   1.07%
impactx::particles::wakefields::HandleSpacecharge                  40    0.05742    0.05742    0.05742   1.03%
ImpactX::add_particles                                              1     0.0532     0.0532     0.0532   0.96%
...

Not an issue there.

Run w/ 1 MPI Rank

cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=ON -DImpactX_COMPUTE=CUDA -DImpactX_PRECISION=SINGLE
cmake --build build -j 12
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
Initializing CUDA...
CUDA initialized with 1 device.
AMReX (25.07-37-gbcd47d18ca36) initialized

Grids Summary:
  Level 0   1 grids  5120 cells  100 % of domain

Beam kinetic energy (MeV): 250
Bunch charge (C): 9.999999717e-10
Particle type: electron
Number of particles: 30000000

...

TinyProfiler total time across processes [min...avg...max]: 20.05 ... 20.05 ... 20.05

--------------------------------------------------------------------------------------------------------------
Name                                                           NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------------------
impactx::transformation::CoordinateTransformation                  80      5.708      5.708      5.708  28.47%
ImpactXParticleContainer::DepositCharge                            40      5.465      5.465      5.465  27.26%
impactx::Push::Drift                                               40      2.595      2.595      2.595  12.94%
impactx::spacecharge::GatherAndPush                                40      2.117      2.117      2.117  10.56%
Redistribute_partition                                             41       2.04       2.04       2.04  10.18%
ImpactXParticleContainer::MinAndMaxPositions                       41     0.7351     0.7351     0.7351   3.67%
impactX::collect_lost_particles                                    40     0.4658     0.4658     0.4658   2.32%
impactx::particles::wakefields::HandleSpacecharge                  40     0.4539     0.4539     0.4539   2.26%
impactx::diagnostics::reduced_beam_characteristics(pc)              2     0.1802     0.1802     0.1802   0.90%
ImpactX::add_particles                                              1     0.1262     0.1262     0.1262   0.63%
ImpactX::AddNParticles                                              1    0.09601    0.09601    0.09601   0.48%
ImpactX::initBeamDistributionFromInputs                             1    0.01139    0.01139    0.01139   0.06%
ImpactX::ResizeMesh                                                41   0.001637   0.001637   0.001637   0.01%
ParticleContainer::RedistributeGPU()                               41   0.001079   0.001079   0.001079   0.01%
...

ax3l avatar Jul 30 '25 19:07 ax3l

To check why impactx::transformation::CoordinateTransformation and Drift take so immensely long when MPI is compiled in on GPU...

ax3l avatar Jul 30 '25 20:07 ax3l

To explain a bit, when tiling is on, RedistributeCPU still needs to sort the particles onto the right tiles in the box. So it's not a no-op. But, we could try to push on optimizing this function.

atmyers avatar Jul 31 '25 17:07 atmyers