WarpX
WarpX copied to clipboard
BackTransformDiagnostics: position offset does not have consistent number of macroparticle
When running this input file: inputs.txt it seems that the metadata of the BackTransformed diagnostics is inconsistent. For instance, running the following Python code:
from openpmd_viewer import OpenPMDTimeSeries
ts = OpenPMDTimeSeries('./diags/diag_btd/')
ts.get_particle(['x'], iteration=3)
results in the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-4c6f54023500> in <module>
----> 1 ts.get_particle(['x'], iteration=3)
~/miniconda3/lib/python3.9/site-packages/openpmd_viewer/openpmd_timeseries/main.py in get_particle(self, var_list, species, t, iteration, select, plot, nbins, plot_range, use_field_mesh, histogram_deposition, **kw)
272 data_list = []
273 for quantity in var_list:
--> 274 data_list.append( self.data_reader.read_species_data(
275 iteration, species, quantity, self.extensions))
276 # Apply selection if needed
~/miniconda3/lib/python3.9/site-packages/openpmd_viewer/openpmd_timeseries/data_reader/data_reader.py in read_species_data(self, iteration, species, record_comp, extensions)
281 filename, iteration, species, record_comp, extensions )
282 elif self.backend == 'openpmd-api':
--> 283 return io_reader.read_species_data(
284 self.series, iteration, species, record_comp, extensions )
285
~/miniconda3/lib/python3.9/site-packages/openpmd_viewer/openpmd_timeseries/data_reader/io_reader/particle_reader.py in read_species_data(series, iteration, species_name, component_name, extensions)
87 if component_name in ['x', 'y', 'z']:
88 offset = get_data(series, species['positionOffset'][component_name])
---> 89 data += offset
90 # - Return momentum in normalized units
91 elif component_name in ['ux', 'uy', 'uz' ]:
ValueError: operands could not be broadcast together with shapes (41886,) (41894,) (41886,)
I still see this issue running WarpX on Summit, even with the new fix:
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 17:18:21) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from openpmd_viewer.addons import LpaDiagnostics
>>> ts = LpaDiagnostics('diags/diag_lab')
>>> ii = 20
>>> x, z, uz = ts.get_particle(species='beam', iteration=ts.iterations[ii], var_list = ['x', 'z','uz'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/ccs/home/rsandberg/.conda/envs/opmd_movie/lib/python3.10/site-packages/openpmd_viewer/openpmd_timeseries/main.py", line 274, in get_particle
data_list.append( self.data_reader.read_species_data(
File "/ccs/home/rsandberg/.conda/envs/opmd_movie/lib/python3.10/site-packages/openpmd_viewer/openpmd_timeseries/data_reader/data_reader.py", line 283, in read_species_data
return io_reader.read_species_data(
File "/ccs/home/rsandberg/.conda/envs/opmd_movie/lib/python3.10/site-packages/openpmd_viewer/openpmd_timeseries/data_reader/io_reader/particle_reader.py", line 89, in read_species_data
data += offset
ValueError: operands could not be broadcast together with shapes (49989,) (50000,) (49989,)
I rebuilt with the latest branch as of today, 9/23/2022. The input file is attached inputs.txt
Thanks for reporting this @RTSandberg.
I just tried the above script with a recent version of WarpX (git hash 6febc63b4d58
) on ~~6~~ 4 GPUs and was not able to reproduce the bug. (In other words, the code runs fine and I am able to run the above Python post-processing code without any error.
~~Could you double check this issue by re-running it with the latest version of WarpX?~~
Nevermind, I was actually able to reproduce this bug when running on 6 GPUs.
I took at closer look at this issue (by adding lots of Print
statements!), and the results are perplexing:
- It seems that openPMD does call
resetDataset
with size50000
(not49989
) when it last touches iteration 20, based on adding the following Print statement
amrex::Print() << "SetupPos: size " << np << std::endl;
here: https://github.com/ECP-WarpX/WarpX/blob/development/Source/Diagnostics/WarpXOpenPMD.cpp#L985 and
amrex::Print() << "resetDataset " << comp << std::endl;
here: https://github.com/ECP-WarpX/WarpX/blob/development/Source/Diagnostics/WarpXOpenPMD.cpp#L992
- It also seems that we do call
storeChunk
immediately afterwards.
Let's try amrex::AllPrint()
and see if all ranks reach the location of resetDataset
and also call it with the same values.
Thanks @ax3l !
I just checked: all MPI ranks are indeed calling resetDataset
with the right size (50000
). I am still confused as to why the array does not seem to have the right size when reading it.
Note that this error still occurs even when turning load-balancing by commenting this line:
#algo.load_balance_intervals = 200
It seems that when I do:
diag_lab.fields_to_plot = none
then the problem disappears.
However, I am a bit confused: why does turning the fields off result in a fix on the particle metadata?
Thank you for this hint and test. I wonder if when we call Redistribute, the bounds set by the fields multifab is removing some particles. I will test this and report here
@RevathiJambunathan That's a good suggestion, but it does not seem that there is any issue with the Redistribute
: as described above, adding Print
statements show that we are making the right openPMD calls with the right sizes.
It seems that the problem is at a lower level...
I have another input deck that sees this issue running on 1 node, 6 GPUs on Summit:
ts = LpaDiagnostics('./diags/diag_btd/')
ts.get_particle(species='beam',var_list=['w','x'],iteration=ts.iterations[1])
Output:
[array([0.12483018, 0.12483018, 0.12483018, ..., 0.12483018, 0.12483018, 0.12483018]), array([ 6.60118937e-07, -3.75790632e-07, 6.18016430e-07, ..., 2.38995954e-07, -2.38995954e-07, -2.38995954e-07])]
ts.get_particle(species='beam',var_list=['w','x'],iteration=ts.iteration[2])
Output:
...
File ~/src/openPMD-viewer/openpmd_viewer/openpmd_timeseries/data_reader/io_reader/particle_reader.py:89, in read_species_data(series, iteration, species_name, component_name, extensions)
87 if component_name in ['x', 'y', 'z']:
88 offset = get_data(series, species['positionOffset'][component_name])
---> 89 data += offset
90 # - Return momentum in normalized units
91 elif component_name in ['ux', 'uy', 'uz' ]:
ValueError: operands could not be broadcast together with shapes (49992,) (50000,) (49992,)
On one of the production simulations, I just noticed that using:
<diag>.openpmd_backend = h5
allows to circumvent the problem.
@RTSandberg Are you able to reproduce this observation with the above input deck?
Thanks Remi! @RTSandberg, as a work-around for ADIOS2, can you please try if using two diagnostics - one only particles and one only fields - works for ADIOS2 as well?
I'll take this on in the 2nd half of the week with the inputs set reported by Ryan: https://github.com/ECP-WarpX/WarpX/issues/3389#issuecomment-1301065526
Update: could not dig into this during the week - will pick up after my vacations.
@ax3l Yes, it seems that the following work-around did work in my case. Thanks for the suggestions.
Here is the corresponding input script: inputs.txt
So in summary, there seems to be 2 potential work-arounds:
- splitting the diagnostics into a fields-specific and a particle-specific diagnostic
- switching to HDF5
The HDF5 workaround works for me. At one time I thought I had a case where the split diagnostic workaround didn't work, but at the present it works on every case I have tried lately
I haven't observed this issue when using HDF5. However, with ADIOS the split diagnostic workaround is not guaranteed; I still get incomplete particle metadata sometimes.
from openpmd_viewer.addons import LpaDiagnostics
btd = LpaDiagnostics('diags/diag_btd')
btd.get_particle(iteration=0, var_list=['id','x','y','z','ux','uy','uz'])
results in
ValueError: operands could not be broadcast together with shapes (49988,) (49992,) (49988,)
Here are the batch and input script for running on 1 node, 6 GPUs on Summit: summit_batch.txt inputs.txt
It is very interesting to note that this issue does not arise if the field BTD is removed, for example by changing the line
diagnostics.diags_names = diag diag_btd fld_btd
to diagnostics.diags_names = diag diag_btd
.
I observed both behaviors on Perlmutter as well
that is indeed quite weird! Thanks for the input file @RTSandberg . I can take a look at this
I am running a even smaller version of the above inputs on my laptop: inputs.txt
cmake -S . -B build -DWarpX_PSATD=ON -DWarpX_DIMS=RZ
cmake --build build -j 3
This finishes after 512 steps and does, e.g. for lab station 1, not write all the meta-data for particles (e.g., charge & mass) properly, even though it does shut down cleanly.
That means WarpXOpenPMDPlot::SetConstParticleRecordsEDPIC
is not properly called, which means the control variable is_last_flush_to_step
is not properly tracked, which means isLastBTDFlush
is not consequently passed to the IO backend when its time.
ah I see. thanks. I can trace this with the input file u shared and see whats happening with isLastBTDFlush
@ax3l I used your input file and ran this locally.
At 512th timestep, the snapshots are not full and hence isLastBTDFlush
will not be set to 1
This is also indicated by the warning at the beginning of the output, which suggests that the snapshots will be full at 2836th timestep.
**** WARNINGS ******************************************************************
* GLOBAL warning list after [ FIRST STEP ]
*
* --> [! ] [BTD] [raised twice]
*
* Simulation might not run long enough to fill all BTD snapshots.
* Final step: 512
* Last BTD snapshot fills around step: 2836
* @ Raised by: ALL
*
********************************************************************************
So I ran the simulation upto 2850 timesteps and I am able to visualize the data using openpmd-viewer
I think that's an orthogonal issue to fix - I think we should close out BTD steps (finalize particle meta-data & zero-out partly filled fields) on simulation end.
With our restart strategy, we would copy and keep the partly filled lab stations in checkpoints open (un-finalized) in case we want to continue.
Will continue focusing on the above issue for now posted by Ryan.
https://github.com/ECP-WarpX/WarpX/issues/3389#issuecomment-1332719750
@RTSandberg I cannot reproduce the issue with the latest openPMD-viewer + openPMD-api and running on CPU. (Works w/o an error.) Will try again on GPU on Summit...
Can reproduce on Summit :+1:
bpls -l diag_btd/openpmd_000000.bp
uint64_t /data/0/particles/beam/id {49712} = 0 / 0
double /data/0/particles/beam/momentum/x {49712} = 0 / 0
double /data/0/particles/beam/momentum/y {49712} = 0 / 0
double /data/0/particles/beam/momentum/z {49712} = 0 / 0
double /data/0/particles/beam/position/x {49712} = 0 / 0
double /data/0/particles/beam/position/y {49712} = 0 / 0
double /data/0/particles/beam/position/z {49712} = 0 / 0
double /data/0/particles/beam/theta {49712} = 0 / 0
double /data/0/particles/beam/weighting {49712} = 0 / 0
bpls -A -l diag_btd/openpmd_000000.bp
string /basePath attr = "/data/%T/"
uint8_t /data/0/closed attr = 1
double /data/0/dt attr = 1
uint32_t /data/0/particles/beam/charge/macroWeighted attr = 0
uint64_t /data/0/particles/beam/charge/shape attr = 49992
float /data/0/particles/beam/charge/timeOffset attr = 0
double /data/0/particles/beam/charge/unitDimension attr = {0, 0, 1, 1, 0, 0, 0}
double /data/0/particles/beam/charge/unitSI attr = 1
double /data/0/particles/beam/charge/value attr = -1.60218e-19
double /data/0/particles/beam/charge/weightingPower attr = 1
string /data/0/particles/beam/currentDeposition attr = "directMorseNielson"
uint32_t /data/0/particles/beam/id/macroWeighted attr = 0
float /data/0/particles/beam/id/timeOffset attr = 0
double /data/0/particles/beam/id/unitDimension attr = {0, 0, 0, 0, 0, 0, 0}
double /data/0/particles/beam/id/unitSI attr = 1
double /data/0/particles/beam/id/weightingPower attr = 0
uint32_t /data/0/particles/beam/mass/macroWeighted attr = 0
uint64_t /data/0/particles/beam/mass/shape attr = 49992
float /data/0/particles/beam/mass/timeOffset attr = 0
double /data/0/particles/beam/mass/unitDimension attr = {0, 1, 0, 0, 0, 0, 0}
double /data/0/particles/beam/mass/unitSI attr = 1
double /data/0/particles/beam/mass/value attr = 9.10938e-31
double /data/0/particles/beam/mass/weightingPower attr = 1
uint32_t /data/0/particles/beam/momentum/macroWeighted attr = 0
float /data/0/particles/beam/momentum/timeOffset attr = 0
double /data/0/particles/beam/momentum/unitDimension attr = {1, 1, -1, 0, 0, 0, 0}
double /data/0/particles/beam/momentum/weightingPower attr = 1
double /data/0/particles/beam/momentum/x/unitSI attr = 1
double /data/0/particles/beam/momentum/y/unitSI attr = 1
double /data/0/particles/beam/momentum/z/unitSI attr = 1
string /data/0/particles/beam/particleInterpolation attr = "momentumConserving"
string /data/0/particles/beam/particlePush attr = "Vay"
double /data/0/particles/beam/particleShape attr = 3
double /data/0/particles/beam/particleShapes attr = {3, 3}
string /data/0/particles/beam/particleSmoothing attr = "none"
uint32_t /data/0/particles/beam/position/macroWeighted attr = 0
float /data/0/particles/beam/position/timeOffset attr = 0
double /data/0/particles/beam/position/unitDimension attr = {1, 0, 0, 0, 0, 0, 0}
double /data/0/particles/beam/position/weightingPower attr = 0
double /data/0/particles/beam/position/x/unitSI attr = 1
double /data/0/particles/beam/position/y/unitSI attr = 1
double /data/0/particles/beam/position/z/unitSI attr = 1
uint32_t /data/0/particles/beam/positionOffset/macroWeighted attr = 0
float /data/0/particles/beam/positionOffset/timeOffset attr = 0
double /data/0/particles/beam/positionOffset/unitDimension attr = {1, 0, 0, 0, 0, 0, 0}
double /data/0/particles/beam/positionOffset/weightingPower attr = 0
uint64_t /data/0/particles/beam/positionOffset/x/shape attr = 49992
double /data/0/particles/beam/positionOffset/x/unitSI attr = 1
double /data/0/particles/beam/positionOffset/x/value attr = 0
uint64_t /data/0/particles/beam/positionOffset/y/shape attr = 49992
double /data/0/particles/beam/positionOffset/y/unitSI attr = 1
double /data/0/particles/beam/positionOffset/y/value attr = 0
uint64_t /data/0/particles/beam/positionOffset/z/shape attr = 49992
double /data/0/particles/beam/positionOffset/z/unitSI attr = 1
double /data/0/particles/beam/positionOffset/z/value attr = 0
uint32_t /data/0/particles/beam/theta/macroWeighted attr = 0
float /data/0/particles/beam/theta/timeOffset attr = 0
double /data/0/particles/beam/theta/unitDimension attr = {0, 0, 0, 0, 0, 0, 0}
double /data/0/particles/beam/theta/unitSI attr = 1
double /data/0/particles/beam/theta/weightingPower attr = 0
uint32_t /data/0/particles/beam/weighting/macroWeighted attr = 1
float /data/0/particles/beam/weighting/timeOffset attr = 0
double /data/0/particles/beam/weighting/unitDimension attr = {0, 0, 0, 0, 0, 0, 0}
double /data/0/particles/beam/weighting/unitSI attr = 1
double /data/0/particles/beam/weighting/weightingPower attr = 1
double /data/0/time attr = 0
double /data/0/timeUnitSI attr = 1
string /date attr = "2022-12-19 10:15:36 -0500"
string /iterationEncoding attr = "fileBased"
string /iterationFormat attr = "openpmd_%06T"
string /meshesPath attr = "fields/"
string /openPMD attr = "1.1.0"
uint32_t /openPMDextension attr = 1
string /particlesPath attr = "particles/"
string /software attr = "WarpX"
string /softwareVersion attr = "22.12-7-gc3eb6ea1efef"
uint64_t __openPMD_internal/openPMD2_adios2_schema attr = 0
Since this shows up in the first lab station, we can simplify to:
...
diagnostics.diags_names = diag_btd fld_btd # was: diag diag_btd fld_btd
...
diag_btd.num_snapshots_lab = 1 # was: 100
fld_btd.num_snapshots_lab = 1 # was: 100
...
and run faster: inputs.txt
The problem disappears if I remove fld_btd
, so there is some side-effect going on here.
Seems important to run it from 6 MPI ranks (and/or with CUDA) to trigger with the inputs.
I am generally able to use the particle BTD in RZ with ADIOS as the openPMD backend without issue as long as I don't also get field output, i.e. set fields_to_plot = none
. I just found an instance where this configuration still leads to inconsistent numbers of macroparticles. It is difficult to reproduce and this is not a great example. I record it here just as evidence that the ADIOS + fields_to_plot = none
workaround is not guaranteed to work
Met with @RemiLehe and @n01r to discuss the issue further. We realized that it is not the metadata that is necessarily inconsistent, but that the particle arrays themselves are corrupted. For example, if we run
for ii, iteration in enumerate(ts_l.iterations): beam_ws, = ts_l.get_particle(iteration = iteration, var_list=['w'], species='beam') print(f'iteration {iteration}: # particles = {len(beam_ws)}')
,
we get the output
iteration 0: # particles = 15239 iteration 1: # particles = 49478 iteration 2: # particles = 50000 iteration 3: # particles = 50000 iteration 4: # particles = 50000 iteration 5: # particles = 50000 iteration 6: # particles = 50000 iteration 7: # particles = 1375
input and batch scripts: batch.txt inputs.txt
Yes, that is correct. I forgot to post this here and only mentioned it on Slack:
In my debugging, for one corrupted BTD step I saw, the last resizing storeChunk
calls for particle data are called but do not end up on disk.
That means, meta-data that is set in that last append of particles is correct but the variable (data) is too small.
Since this shows up in the first lab station, we can simplify to:
... diagnostics.diags_names = diag_btd fld_btd # was: diag diag_btd fld_btd ... diag_btd.num_snapshots_lab = 1 # was: 100 fld_btd.num_snapshots_lab = 1 # was: 100 ...
and run faster: inputs.txt
The problem disappears if I remove
fld_btd
, so there is some side-effect going on here.Seems important to run it from 6 MPI ranks (and/or with CUDA) to trigger with the inputs.
I used this input file, and I saw same num of particles is correct:
../bpls -l diag_btd/openpmd_000000.bp uint64_t /data/0/particles/beam/id {49992} = 1 / 50000 double /data/0/particles/beam/momentum/x {49992} = -1.17892e-19 / 1.17892e-19 double /data/0/particles/beam/momentum/y {49992} = -1.10237e-19 / 1.10237e-19 double /data/0/particles/beam/momentum/z {49992} = 4.15973e-19 / 6.32712e-19 double /data/0/particles/beam/position/x {49992} = -7.29641e-05 / 7.29641e-05 double /data/0/particles/beam/position/y {49992} = -8.36205e-05 / 8.36205e-05 double /data/0/particles/beam/position/z {49992} = -0.000179486 / -1.4209e-05 double /data/0/particles/beam/theta {49992} = -3.14159 / 3.14159 double /data/0/particles/beam/weighting {49992} = 0.12483 / 0.12483
../bpls -A -l diag_btd/openpmd_000000.bp |grep shape uint64_t /data/0/particles/beam/charge/shape attr = 49992 uint64_t /data/0/particles/beam/mass/shape attr = 49992 uint64_t /data/0/particles/beam/positionOffset/x/shape attr = 49992 uint64_t /data/0/particles/beam/positionOffset/y/shape attr = 49992 uint64_t /data/0/particles/beam/positionOffset/z/shape attr = 49992
Testing a fix in #3657
Idea for a fast test case of the issue (if we need it again):
- simulation box filled to 1/3rd with particles: the lower 2/3rd has no particles
- 2 MPI ranks, distributed along x
- BTD diagnostics for particles
Please note that ADIOS2 master now has support for Joined Arrays (in BP4 only for now), where you don't need to specify Shape on the writing side (nor Starts), and the reader will put all blocks into a Global Array. This is basically Local Arrays where the blocks are joined in one dimension together to form a Global Array, and is a good match for this use case. https://github.com/ornladios/ADIOS2/pull/3466