impactx Optimization of Comms for Single Box

For simple simulations of one process and box, no MR, ImpactX can be optimized to perform less time in "communication" routines such as particle redistribution.

This is important, because:

we have many workflows that can run with one process (e.g., 1 GPU or 1 process + OMP threads)
many people will benchmark ImpactX against non-MPI implemented legacy codes.

Here is a reproducer where we spend too much time.

Configure & Build

As of 25.07, all relative to the repositories root directory:

cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=OFF
cmake --build build -j 12

Run w/o MPI

A typical case is a space-charge simulation w/o MR, e.g., this input: expanding_drift_fft.txt

./build/bin/impactx expanding_drift_fft.txt

Initializing AMReX (25.07-32-gaf07c6f1d7b8)...
OMP initialized with 14 OMP threads
AMReX (25.07-32-gaf07c6f1d7b8) initialized

Grids Summary:
  Level 0   1 grids  5120 cells  100 % of domain

...
--------------------------------------------------------------------------------------------------------------
Name                                                           NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------------------
ParticleContainer::RedistributeCPU()                               41      6.391      6.391      6.391  19.70%
ImpactX::add_particles                                              1      5.006      5.006      5.006  15.43%
impactx::spacecharge::GatherAndPush                                40      4.898      4.898      4.898  15.10%
ImpactXParticleContainer::DepositCharge                            40       2.72       2.72       2.72   8.38%
ablastr::particles::deposit_charge::ChargeDeposition               84      2.442      2.442      2.442   7.53%
impactx::transformation::CoordinateTransformation                  80      1.794      1.794      1.794   5.53%
...

In this simulation, we use a mesh of constant number of cells $N_{x,y,z}$, but resize the extend (or $d_{x,y,z}$) to always fit the particles in it. Consequently, particles will never be removed/marked as invalid during redistribute. If helpful, we could also pass this guarantee to AMReX into the redistribute API.

Run w/ 1 MPI Rank

This would be a second step and is less urgent.

In quick tests, the performance drops a little further if MPI is enabled (but not "used" with more than one process), which hints that we might need to add a runtime check for the size of the box array being 1:

cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=ON
cmake --build build -j 12

./build/bin/impactx expanding_drift_fft.txt

MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
OMP initialized with 14 OMP threads
AMReX (25.07-32-gaf07c6f1d7b8) initialized

Grids Summary:
  Level 0   1 grids  5120 cells  100 % of domain

...
--------------------------------------------------------------------------------------------------------------
Name                                                           NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------------------
ParticleContainer::RedistributeCPU()                               41      7.096      7.096      7.096  21.27%
ImpactX::add_particles                                              1      4.951      4.951      4.951  14.84%
impactx::spacecharge::GatherAndPush                                40      4.827      4.827      4.827  14.46%
ImpactXParticleContainer::DepositCharge                            40       2.64       2.64       2.64   7.91%
ablastr::particles::deposit_charge::ChargeDeposition               88      2.543      2.543      2.543   7.62%
impactx::Push::Drift                                               40      1.739      1.739      1.739   5.21%
impactx::transformation::CoordinateTransformation                  80      1.702      1.702      1.702   5.10%

...

Jul 30 '25 15:07 ax3l

Test on my laptop's GPU:

Run w/o MPI

cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=OFF -DImpactX_COMPUTE=CUDA -DImpactX_PRECISION=SINGLE
cmake --build build -j 12

Initializing AMReX (25.07-37-g08f25e1f7ccb-dirty)...
Initializing CUDA...
CUDA initialized with 1 device.
AMReX (25.07-37-g08f25e1f7ccb-dirty) initialized

Grids Summary:
  Level 0   1 grids  5120 cells  100 % of domain

Beam kinetic energy (MeV): 250
Bunch charge (C): 9.999999717e-10
Particle type: electron
Number of particles: 30000000
...

TinyProfiler total time across processes [min...avg...max]: 5.555 ... 5.555 ... 5.555

--------------------------------------------------------------------------------------------------------------
Name                                                           NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------------------
ImpactXParticleContainer::DepositCharge                            40      3.368      3.368      3.368  60.64%
impactx::transformation::CoordinateTransformation                  80      0.771      0.771      0.771  13.88%
impactx::Push::Drift                                               40     0.3451     0.3451     0.3451   6.21%
impactx::spacecharge::GatherAndPush                                40      0.313      0.313      0.313   5.63%
Redistribute_partition                                             41     0.2518     0.2518     0.2518   4.53%
FFT::R2C                                                            1      0.179      0.179      0.179   3.22%
ImpactXParticleContainer::MinAndMaxPositions                       41    0.09687    0.09687    0.09687   1.74%
impactX::collect_lost_particles                                    40    0.05956    0.05956    0.05956   1.07%
impactx::particles::wakefields::HandleSpacecharge                  40    0.05742    0.05742    0.05742   1.03%
ImpactX::add_particles                                              1     0.0532     0.0532     0.0532   0.96%
...

Not an issue there.

Run w/ 1 MPI Rank

cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=ON -DImpactX_COMPUTE=CUDA -DImpactX_PRECISION=SINGLE
cmake --build build -j 12

MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
Initializing CUDA...
CUDA initialized with 1 device.
AMReX (25.07-37-gbcd47d18ca36) initialized

Grids Summary:
  Level 0   1 grids  5120 cells  100 % of domain

Beam kinetic energy (MeV): 250
Bunch charge (C): 9.999999717e-10
Particle type: electron
Number of particles: 30000000

...

TinyProfiler total time across processes [min...avg...max]: 20.05 ... 20.05 ... 20.05

--------------------------------------------------------------------------------------------------------------
Name                                                           NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------------------
impactx::transformation::CoordinateTransformation                  80      5.708      5.708      5.708  28.47%
ImpactXParticleContainer::DepositCharge                            40      5.465      5.465      5.465  27.26%
impactx::Push::Drift                                               40      2.595      2.595      2.595  12.94%
impactx::spacecharge::GatherAndPush                                40      2.117      2.117      2.117  10.56%
Redistribute_partition                                             41       2.04       2.04       2.04  10.18%
ImpactXParticleContainer::MinAndMaxPositions                       41     0.7351     0.7351     0.7351   3.67%
impactX::collect_lost_particles                                    40     0.4658     0.4658     0.4658   2.32%
impactx::particles::wakefields::HandleSpacecharge                  40     0.4539     0.4539     0.4539   2.26%
impactx::diagnostics::reduced_beam_characteristics(pc)              2     0.1802     0.1802     0.1802   0.90%
ImpactX::add_particles                                              1     0.1262     0.1262     0.1262   0.63%
ImpactX::AddNParticles                                              1    0.09601    0.09601    0.09601   0.48%
ImpactX::initBeamDistributionFromInputs                             1    0.01139    0.01139    0.01139   0.06%
ImpactX::ResizeMesh                                                41   0.001637   0.001637   0.001637   0.01%
ParticleContainer::RedistributeGPU()                               41   0.001079   0.001079   0.001079   0.01%
...

Jul 30 '25 19:07 ax3l

To check why impactx::transformation::CoordinateTransformation and Drift take so immensely long when MPI is compiled in on GPU...

Jul 30 '25 20:07 ax3l

To explain a bit, when tiling is on, RedistributeCPU still needs to sort the particles onto the right tiles in the box. So it's not a no-op. But, we could try to push on optimizing this function.

Jul 31 '25 17:07 atmyers

impactx impactx copied to clipboard

Optimization of Comms for Single Box

Configure & Build

Run w/o MPI

Run w/ 1 MPI Rank

Run w/o MPI

Run w/ 1 MPI Rank

impactx
impactx copied to clipboard