impactx
impactx copied to clipboard
Optimization of Comms for Single Box
For simple simulations of one process and box, no MR, ImpactX can be optimized to perform less time in "communication" routines such as particle redistribution.
This is important, because:
- we have many workflows that can run with one process (e.g., 1 GPU or 1 process + OMP threads)
- many people will benchmark ImpactX against non-MPI implemented legacy codes.
Here is a reproducer where we spend too much time.
Configure & Build
As of 25.07, all relative to the repositories root directory:
cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=OFF
cmake --build build -j 12
Run w/o MPI
A typical case is a space-charge simulation w/o MR, e.g., this input: expanding_drift_fft.txt
./build/bin/impactx expanding_drift_fft.txt
Initializing AMReX (25.07-32-gaf07c6f1d7b8)...
OMP initialized with 14 OMP threads
AMReX (25.07-32-gaf07c6f1d7b8) initialized
Grids Summary:
Level 0 1 grids 5120 cells 100 % of domain
...
--------------------------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
--------------------------------------------------------------------------------------------------------------
ParticleContainer::RedistributeCPU() 41 6.391 6.391 6.391 19.70%
ImpactX::add_particles 1 5.006 5.006 5.006 15.43%
impactx::spacecharge::GatherAndPush 40 4.898 4.898 4.898 15.10%
ImpactXParticleContainer::DepositCharge 40 2.72 2.72 2.72 8.38%
ablastr::particles::deposit_charge::ChargeDeposition 84 2.442 2.442 2.442 7.53%
impactx::transformation::CoordinateTransformation 80 1.794 1.794 1.794 5.53%
...
In this simulation, we use a mesh of constant number of cells $N_{x,y,z}$, but resize the extend (or $d_{x,y,z}$) to always fit the particles in it. Consequently, particles will never be removed/marked as invalid during redistribute. If helpful, we could also pass this guarantee to AMReX into the redistribute API.
Run w/ 1 MPI Rank
This would be a second step and is less urgent.
In quick tests, the performance drops a little further if MPI is enabled (but not "used" with more than one process), which hints that we might need to add a runtime check for the size of the box array being 1:
cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=ON
cmake --build build -j 12
./build/bin/impactx expanding_drift_fft.txt
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
OMP initialized with 14 OMP threads
AMReX (25.07-32-gaf07c6f1d7b8) initialized
Grids Summary:
Level 0 1 grids 5120 cells 100 % of domain
...
--------------------------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
--------------------------------------------------------------------------------------------------------------
ParticleContainer::RedistributeCPU() 41 7.096 7.096 7.096 21.27%
ImpactX::add_particles 1 4.951 4.951 4.951 14.84%
impactx::spacecharge::GatherAndPush 40 4.827 4.827 4.827 14.46%
ImpactXParticleContainer::DepositCharge 40 2.64 2.64 2.64 7.91%
ablastr::particles::deposit_charge::ChargeDeposition 88 2.543 2.543 2.543 7.62%
impactx::Push::Drift 40 1.739 1.739 1.739 5.21%
impactx::transformation::CoordinateTransformation 80 1.702 1.702 1.702 5.10%
...
Test on my laptop's GPU:
Run w/o MPI
cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=OFF -DImpactX_COMPUTE=CUDA -DImpactX_PRECISION=SINGLE
cmake --build build -j 12
Initializing AMReX (25.07-37-g08f25e1f7ccb-dirty)...
Initializing CUDA...
CUDA initialized with 1 device.
AMReX (25.07-37-g08f25e1f7ccb-dirty) initialized
Grids Summary:
Level 0 1 grids 5120 cells 100 % of domain
Beam kinetic energy (MeV): 250
Bunch charge (C): 9.999999717e-10
Particle type: electron
Number of particles: 30000000
...
TinyProfiler total time across processes [min...avg...max]: 5.555 ... 5.555 ... 5.555
--------------------------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
--------------------------------------------------------------------------------------------------------------
ImpactXParticleContainer::DepositCharge 40 3.368 3.368 3.368 60.64%
impactx::transformation::CoordinateTransformation 80 0.771 0.771 0.771 13.88%
impactx::Push::Drift 40 0.3451 0.3451 0.3451 6.21%
impactx::spacecharge::GatherAndPush 40 0.313 0.313 0.313 5.63%
Redistribute_partition 41 0.2518 0.2518 0.2518 4.53%
FFT::R2C 1 0.179 0.179 0.179 3.22%
ImpactXParticleContainer::MinAndMaxPositions 41 0.09687 0.09687 0.09687 1.74%
impactX::collect_lost_particles 40 0.05956 0.05956 0.05956 1.07%
impactx::particles::wakefields::HandleSpacecharge 40 0.05742 0.05742 0.05742 1.03%
ImpactX::add_particles 1 0.0532 0.0532 0.0532 0.96%
...
Not an issue there.
Run w/ 1 MPI Rank
cmake --fresh -S . -B build -DImpactX_FFT=ON -DImpactX_MPI=ON -DImpactX_COMPUTE=CUDA -DImpactX_PRECISION=SINGLE
cmake --build build -j 12
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
Initializing CUDA...
CUDA initialized with 1 device.
AMReX (25.07-37-gbcd47d18ca36) initialized
Grids Summary:
Level 0 1 grids 5120 cells 100 % of domain
Beam kinetic energy (MeV): 250
Bunch charge (C): 9.999999717e-10
Particle type: electron
Number of particles: 30000000
...
TinyProfiler total time across processes [min...avg...max]: 20.05 ... 20.05 ... 20.05
--------------------------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
--------------------------------------------------------------------------------------------------------------
impactx::transformation::CoordinateTransformation 80 5.708 5.708 5.708 28.47%
ImpactXParticleContainer::DepositCharge 40 5.465 5.465 5.465 27.26%
impactx::Push::Drift 40 2.595 2.595 2.595 12.94%
impactx::spacecharge::GatherAndPush 40 2.117 2.117 2.117 10.56%
Redistribute_partition 41 2.04 2.04 2.04 10.18%
ImpactXParticleContainer::MinAndMaxPositions 41 0.7351 0.7351 0.7351 3.67%
impactX::collect_lost_particles 40 0.4658 0.4658 0.4658 2.32%
impactx::particles::wakefields::HandleSpacecharge 40 0.4539 0.4539 0.4539 2.26%
impactx::diagnostics::reduced_beam_characteristics(pc) 2 0.1802 0.1802 0.1802 0.90%
ImpactX::add_particles 1 0.1262 0.1262 0.1262 0.63%
ImpactX::AddNParticles 1 0.09601 0.09601 0.09601 0.48%
ImpactX::initBeamDistributionFromInputs 1 0.01139 0.01139 0.01139 0.06%
ImpactX::ResizeMesh 41 0.001637 0.001637 0.001637 0.01%
ParticleContainer::RedistributeGPU() 41 0.001079 0.001079 0.001079 0.01%
...
To check why impactx::transformation::CoordinateTransformation and Drift take so immensely long when MPI is compiled in on GPU...
To explain a bit, when tiling is on, RedistributeCPU still needs to sort the particles onto the right tiles in the box. So it's not a no-op. But, we could try to push on optimizing this function.