ippl icon indicating copy to clipboard operation
ippl copied to clipboard

Fix bug in modified locateParticles

Open srikrrish opened this issue 1 year ago • 2 comments

The modified version of locateParticles with nearest neighbor search has a bug which leads to lack of charge conservation or seg faults in TestScatter and PenningTrap with load balancing in both CPUs and GPUs.

srikrrish avatar Mar 26 '24 13:03 srikrrish

The last commit still doesn't fix the issue. Basically I observe two issues for the problem srun ./PenningTrap 32 32 32 655360 400 FFT 0.01 LeapFrog -b 1.0 --info 5

  1. On OpenMP builds with 4 nodes, 16 taskspernode (64 MPI ranks) and 2 OMP threads the code runs and produces output and timing files but doesn't terminate and had to be manually cancelled
  2. On GPU builds with 16 nodes and 4 GPUs per node (again 64 MPI ranks/GPUs) there is charge conservation error on the first step itself
Warning: Option '32' is not parsed by Ippl.
Warning: Option '32' is not parsed by Ippl.
Warning: Option '32' is not parsed by Ippl.
Warning: Option '655360' is not parsed by Ippl.
Warning: Option '400' is not parsed by Ippl.
Warning: Option 'FFT' is not parsed by Ippl.
Warning: Option '0.01' is not parsed by Ippl.
Warning: Option 'LeapFrog' is not parsed by Ippl.
Pre Run{0}> Discretization:
Pre Run{0}> nt 400 Np= 655360 grid = ( 32 , 32 , 32 )
Initialize Particles{0}> Starting first repartition
Initialize Particles{0}> particles created and initial conditions assigned
scatter {0}> 0
PenningTrap{0}> Starting iterations ...
Pre-step{0}> Done
scatter {0}> 0.0544169
scatter {0}> Time step: 0
scatter {0}> Total particles in the sim. 655360 after update: 655360
scatter {0}> Rel. error in charge conservation: 0.0544169

srikrrish avatar May 08 '24 08:05 srikrrish

A few thoughts to 1: a) we get stuck in a destructor can you add prints in Ippl::finalize()?

void finalize() { Comm->deleteAllBuffers(); Kokkos::finalize(); // we must first delete the communicator and // afterwards the MPI environment Comm.reset(nullptr); Env.reset(nullptr); } 2: maybe the write timing has a problem.

aaadelmann avatar May 08 '24 12:05 aaadelmann