impactx icon indicating copy to clipboard operation
impactx copied to clipboard

Beam Diagnostics is too Slow

Open ax3l opened this issue 3 months ago • 8 comments

The diagnostics code in reduced_beam_characteristics(pc) is too slow. In 1-MPI-rank simulations like the HTU beamline, when setting sim.particle_container().store_beam_moments = True, it is dominating the runtime by ~1.5x compared to the next costly element of the actual simulation.

TinyProfiler total time across processes [min...avg...max]: 0.02604 ... 0.02604 ... 0.02604

-------------------------------------------------------------------------------------------------------
Name                                                    NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
-------------------------------------------------------------------------------------------------------
impactx::diagnostics::reduced_beam_characteristics(pc)      91    0.01197    0.01197    0.01197  45.96%
impactx::Push::ChrQuad                                      34   0.007997   0.007997   0.007997  30.71%
impactx::Push::ExactDrift                                   33   0.001654   0.001654   0.001654   6.35%
impactx::Push::ExactSbend                                    5  0.0004234  0.0004234  0.0004234   1.63%
impactX::collect_lost_particles                             91  0.0003877  0.0003877  0.0003877   1.49%
ImpactX::evolve::slice_step                                 91  0.0003815  0.0003815  0.0003815   1.47%
ImpactX::add_particles                                       1  0.0003395  0.0003395  0.0003395   1.30%
impactx::Push::Kicker                                        8  0.0002024  0.0002024  0.0002024   0.78%
ImpactXParticleContainer::record_beam_moments               91  0.0001794  0.0001794  0.0001794   0.69%
DistributionMapping::LeastUsedCPUs()                         1  0.0001495  0.0001495  0.0001495   0.57%
ImpactX::track_particles                                     1   3.08e-05   3.08e-05   3.08e-05   0.12%
impactx::Push                                               91  1.807e-05  1.807e-05  1.807e-05   0.07%
AmrMesh::MakeDistributionMap()                               1  7.808e-06  7.808e-06  7.808e-06   0.03%
DistributionMapping::SFCProcessorMapDoIt()                   1  2.937e-06  2.937e-06  2.937e-06   0.01%
Other                                                      357  0.0001655  0.0001655  0.0001655   0.64%
-------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------
Name                                                    NCalls  Incl. Min  Incl. Avg  Incl. Max   Max %
-------------------------------------------------------------------------------------------------------
ImpactX::track_particles                                     1    0.02335    0.02335    0.02335  89.69%
ImpactX::evolve::slice_step                                 91    0.02331    0.02331    0.02331  89.52%
ImpactXParticleContainer::record_beam_moments               91    0.01215    0.01215    0.01215  46.65%
impactx::diagnostics::reduced_beam_characteristics(pc)      91    0.01197    0.01197    0.01197  45.96%
impactx::Push                                               91     0.0103     0.0103     0.0103  39.56%
impactx::Push::ChrQuad                                      34   0.007999   0.007999   0.007999  30.72%
impactx::Push::ExactDrift                                   33   0.001656   0.001656   0.001656   6.36%
impactx::Push::ExactSbend                                    5  0.0004239  0.0004239  0.0004239   1.63%
ImpactX::add_particles                                       1  0.0003912  0.0003912  0.0003912   1.50%
impactX::collect_lost_particles                             91  0.0003877  0.0003877  0.0003877   1.49%
impactx::Push::Kicker                                        8   0.000203   0.000203   0.000203   0.78%
AmrMesh::MakeDistributionMap()                               1  0.0001608  0.0001608  0.0001608   0.62%
DistributionMapping::SFCProcessorMapDoIt()                   1   0.000153   0.000153   0.000153   0.59%
DistributionMapping::LeastUsedCPUs()                         1  0.0001495  0.0001495  0.0001495   0.57%
Other                                                      357  0.0001655  0.0001655  0.0001655   0.64%
-------------------------------------------------------------------------------------------------------

I think that amrex::ParticleReduce is OpenMP parallelized over particle tiles, but maybe it is not working or can be optimized?

Additionally can some operations be vectorized on CPU that are not auto-vectorized?

Or do we just calculate/reduce way too many variables (currently: two full-Np reductions with the 2nd one on 22 variables) and need to introduce a more fine-tuned approach, as we do for optionally calculating the (costly) eigenemittances?

ax3l avatar Aug 18 '25 04:08 ax3l

Reproducer: rbc_costly_reproducer.tar.gz

ax3l avatar Aug 18 '25 05:08 ax3l

Could you try

$ git diff Src/Particle/AMReX_ParticleReduce.H
diff --git a/Src/Particle/AMReX_ParticleReduce.H b/Src/Particle/AMReX_ParticleReduce.H
index 50002e2932..f8f16ed7b3 100644
--- a/Src/Particle/AMReX_ParticleReduce.H
+++ b/Src/Particle/AMReX_ParticleReduce.H
@@ -1248,7 +1248,7 @@ ParticleReduce (PC const& pc, int lev_min, int lev_max, F const& f, ReduceOps& r
             ptile_ptrs.push_back(&(kv.second));
         }
 #if !defined(AMREX_USE_GPU) && defined(AMREX_USE_OMP)
-#pragma omp parallel for
+#pragma omp parallel
 #endif  
         for (int pmap_it = 0; pmap_it < static_cast<int>(ptile_ptrs.size()); ++pmap_it)
         {

WeiqunZhang avatar Aug 18 '25 17:08 WeiqunZhang

Oh, don't do that.

WeiqunZhang avatar Aug 18 '25 17:08 WeiqunZhang

@atmyers Any reasons that we are not using Particle iterator?

It could be something like

#pragma omp parallel
for (int lev ..)
{
    for (ParIter...)
}

WeiqunZhang avatar Aug 18 '25 17:08 WeiqunZhang

Anyway, using ParIter will not change performance that much.

I guess it's too many reduction variables.

WeiqunZhang avatar Aug 18 '25 17:08 WeiqunZhang

I think we did not use ParIter because of our custom way to init particles in ImpactX https://github.com/AMReX-Codes/amrex/pull/2695 . I wonder if this can be overhauled again with #862 being in now (or stays the same)?

ax3l avatar Aug 18 '25 18:08 ax3l

@atmyers Any reasons that we are not using Particle iterator?

It could be something like

#pragma omp parallel
for (int lev ..)
{
    for (ParIter...)
}

We did it this way to support a pattern used in ImpactX: https://github.com/AMReX-Codes/amrex/pull/2695

atmyers avatar Aug 19 '25 16:08 atmyers

Could we add a comment there?

WeiqunZhang avatar Aug 19 '25 16:08 WeiqunZhang