impactx
impactx copied to clipboard
Beam Diagnostics is too Slow
The diagnostics code in reduced_beam_characteristics(pc) is too slow. In 1-MPI-rank simulations like the HTU beamline, when setting sim.particle_container().store_beam_moments = True, it is dominating the runtime by ~1.5x compared to the next costly element of the actual simulation.
TinyProfiler total time across processes [min...avg...max]: 0.02604 ... 0.02604 ... 0.02604
-------------------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
-------------------------------------------------------------------------------------------------------
impactx::diagnostics::reduced_beam_characteristics(pc) 91 0.01197 0.01197 0.01197 45.96%
impactx::Push::ChrQuad 34 0.007997 0.007997 0.007997 30.71%
impactx::Push::ExactDrift 33 0.001654 0.001654 0.001654 6.35%
impactx::Push::ExactSbend 5 0.0004234 0.0004234 0.0004234 1.63%
impactX::collect_lost_particles 91 0.0003877 0.0003877 0.0003877 1.49%
ImpactX::evolve::slice_step 91 0.0003815 0.0003815 0.0003815 1.47%
ImpactX::add_particles 1 0.0003395 0.0003395 0.0003395 1.30%
impactx::Push::Kicker 8 0.0002024 0.0002024 0.0002024 0.78%
ImpactXParticleContainer::record_beam_moments 91 0.0001794 0.0001794 0.0001794 0.69%
DistributionMapping::LeastUsedCPUs() 1 0.0001495 0.0001495 0.0001495 0.57%
ImpactX::track_particles 1 3.08e-05 3.08e-05 3.08e-05 0.12%
impactx::Push 91 1.807e-05 1.807e-05 1.807e-05 0.07%
AmrMesh::MakeDistributionMap() 1 7.808e-06 7.808e-06 7.808e-06 0.03%
DistributionMapping::SFCProcessorMapDoIt() 1 2.937e-06 2.937e-06 2.937e-06 0.01%
Other 357 0.0001655 0.0001655 0.0001655 0.64%
-------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------
Name NCalls Incl. Min Incl. Avg Incl. Max Max %
-------------------------------------------------------------------------------------------------------
ImpactX::track_particles 1 0.02335 0.02335 0.02335 89.69%
ImpactX::evolve::slice_step 91 0.02331 0.02331 0.02331 89.52%
ImpactXParticleContainer::record_beam_moments 91 0.01215 0.01215 0.01215 46.65%
impactx::diagnostics::reduced_beam_characteristics(pc) 91 0.01197 0.01197 0.01197 45.96%
impactx::Push 91 0.0103 0.0103 0.0103 39.56%
impactx::Push::ChrQuad 34 0.007999 0.007999 0.007999 30.72%
impactx::Push::ExactDrift 33 0.001656 0.001656 0.001656 6.36%
impactx::Push::ExactSbend 5 0.0004239 0.0004239 0.0004239 1.63%
ImpactX::add_particles 1 0.0003912 0.0003912 0.0003912 1.50%
impactX::collect_lost_particles 91 0.0003877 0.0003877 0.0003877 1.49%
impactx::Push::Kicker 8 0.000203 0.000203 0.000203 0.78%
AmrMesh::MakeDistributionMap() 1 0.0001608 0.0001608 0.0001608 0.62%
DistributionMapping::SFCProcessorMapDoIt() 1 0.000153 0.000153 0.000153 0.59%
DistributionMapping::LeastUsedCPUs() 1 0.0001495 0.0001495 0.0001495 0.57%
Other 357 0.0001655 0.0001655 0.0001655 0.64%
-------------------------------------------------------------------------------------------------------
I think that amrex::ParticleReduce is OpenMP parallelized over particle tiles, but maybe it is not working or can be optimized?
Additionally can some operations be vectorized on CPU that are not auto-vectorized?
Or do we just calculate/reduce way too many variables (currently: two full-Np reductions with the 2nd one on 22 variables) and need to introduce a more fine-tuned approach, as we do for optionally calculating the (costly) eigenemittances?
Reproducer: rbc_costly_reproducer.tar.gz
Could you try
$ git diff Src/Particle/AMReX_ParticleReduce.H
diff --git a/Src/Particle/AMReX_ParticleReduce.H b/Src/Particle/AMReX_ParticleReduce.H
index 50002e2932..f8f16ed7b3 100644
--- a/Src/Particle/AMReX_ParticleReduce.H
+++ b/Src/Particle/AMReX_ParticleReduce.H
@@ -1248,7 +1248,7 @@ ParticleReduce (PC const& pc, int lev_min, int lev_max, F const& f, ReduceOps& r
ptile_ptrs.push_back(&(kv.second));
}
#if !defined(AMREX_USE_GPU) && defined(AMREX_USE_OMP)
-#pragma omp parallel for
+#pragma omp parallel
#endif
for (int pmap_it = 0; pmap_it < static_cast<int>(ptile_ptrs.size()); ++pmap_it)
{
Oh, don't do that.
@atmyers Any reasons that we are not using Particle iterator?
It could be something like
#pragma omp parallel
for (int lev ..)
{
for (ParIter...)
}
Anyway, using ParIter will not change performance that much.
I guess it's too many reduction variables.
I think we did not use ParIter because of our custom way to init particles in ImpactX https://github.com/AMReX-Codes/amrex/pull/2695 . I wonder if this can be overhauled again with #862 being in now (or stays the same)?
@atmyers Any reasons that we are not using Particle iterator?
It could be something like
#pragma omp parallel for (int lev ..) { for (ParIter...) }
We did it this way to support a pattern used in ImpactX: https://github.com/AMReX-Codes/amrex/pull/2695
Could we add a comment there?