picongpu
picongpu copied to clipboard
Config validation still rudimentary
One of the command line options of picongpu --validate
can in principle validate all the command line options that are specified in the .cfg
files.
This validation is still pretty rudimentary as it cannot so far check:
- if
<species>_calorimeter.period
and<species>_calorimeter.file
appear the same number of times - if the full box size of the sim is valid with the supercell size that has been compiled in
- if all tbg variables that have been defined are actually used (which @psychocoderHPC says is hard since picongpu doesn't know anything about tbg and the variables also do not need to be prefixed with
TBG_
)
So, are there ways to make this better so that as many mistakes as possible can be caught before the actual run of the job
Not saying the following approach is any good and ofc it is not a replacement for proper validation, but might ease the pain in some cases. So not suggesting to necessarily do that, just sharing my thoughts.
Imagine one has a large simulation to do, but for some reason it is 'risky' in a way it may realistically fail. For example, it uses a new plugin, which is not well-tested yet, or it is just a new type of simulation and smth may be not adjusted well. In case it crashes at the very first iterations, not only no valuable information is obtained, but also user's and a lot of computational time is wasted, also it might take a lot of time even being in the queue. Now to eliminate this early-iteration crashes due to wrong parameters, one could try to first run the same simulation e.g., for 10 or 100 time steps, which takes much less computational time, and probably time in the queue as well. And then in case that went well to enqueue the proper large simulation. This works kinda like insurance: in case everything was actually fine, one just did one small unnecessary run and wasted a little bit of his and computational time; but in case of an early crash, the losses are cut by a lot. Of course, this could be automated with a 'validated run' script, which does this routine of queue up small run, check, if necessary queue up large run.
@sbastrakov What you are suggesting is basically what I am doing most of the times. Either I try to go 2D and with full size to really get information on how the sim is evolving - or I go for a single/few-GPU 3D case where my boundary effects will basically destroy any physical results from the sim but then this is just a test if all of the plugins actually work.
I also tried the --validate
option and got fooled once, since it didn't catch the critical parts.
So maybe we can make this validation more profound. =)
Just as a side node, imho I would not invest much more in the command line options. As noted in https://github.com/ComputationalRadiationPhysics/picongpu/issues/2421#issuecomment-349956245, we should implement runtime input files for all but the most rudimentary options.
(performance-wise, only rank 0 should read those input file(s) and broadcast it.)
Of course, feel free to submit re-factorings and additional checks if it helps in the current situation and with little work invested!
if
_calorimeter.period and _calorimeter.file appear the same number of times
dunny, is it valid syntax? Sounds like one would need to call Plugin::load()
for that which is also allowed to allocated a lot of stuff. so probably not easy.
if the full box size of the sim is valid with the supercell size that has been compiled in
That check can be moved most certainly! Also with gridDist
check.
if all tbg variables that have been defined are actually used
Not easy. Not all tbg
vars end up in the program options, some control restarts, some control job script "comments", etc.
@n01r Yeah, some users of picador also did that, that's why we created a script that I described to at least automate that a little bit, so that a user does not need to manually check and submit the full simulation. I guess validation could be improved (hard for me to judge), but there could be no silver bullet, some things could always be properly checked only once the simulation is really started.
@ax3l Right, any approach that is different from the current tbg-cfg workflow would also be nice.
At some point even nicely structured CFG files become pretty confusing if you have enough diagnostics.
Runtime input files in python ftw!
I have some idea how we can build on the plans I mentioned above and @codingS3b 's & @jkelling 's work to add support for PICMI (new repo) with PIConGPU.
(See an early LWFA example.)
Please talk to me if anyone is interested to invest work into this.
Wow, that PICMI looks cool. Not sure I could invest work into this, but definitely interested.
I think time-span wise it makes sense to look into this around end of the year, beginning next year with implementation work. By then we have the majority of @codingS3b 's & @jkelling 's work mainlined to built upon. We can also make this a student's project. Isolated into #2715 for documentation and planning.
It seems this still does not work, right? (my branch cloned from dev in April 2019)
A small sim with the --validate option successfully ran yesterday, and it did not report any problems (at least I didn't see - the report should have been in the stdout or stderr, right?). But the calorimeter.file and calorimeter.filter was missing (the real sim crashed).
Or did I do anything wrong?
@wmles the idea behind --validate
is to only check your command-line (set in .cfg
) parameters are syntactically correct. Unfortunately, there is no good way of checking logical correctness without starting a simulation. On a similar note, some other PIC codes support a mode when a simulation is actually started, just memory not allocated for the fields and particles and the latter not generated. Then things like missing files or variables in .param files being in an invalid range could be checked for. However, even this is not that easy to implement maybe and bug-prone on its own.
@sbastrakov I think @wmles meant the run-time/command-line parameter of the calorimeter plugin (set via .cfg
). As far as I know, a missing filter
flag (if required) should throw an error syntactical error. Or am I mistaken?
Afair, such sub-options likely have a default and it is just not properly documented in that plugin.
Update: hm, looks note likes this. Please open a plugin-specific issue if that is a bug: https://github.com/ComputationalRadiationPhysics/picongpu/blob/c520e8348ee29802aac6168a09e8e3d5fe97ae35/include/picongpu/plugins/particleCalorimeter/ParticleCalorimeter.hpp#L588-L589
@ax3l I believe this code is not called at the argument parsing, but only afterward: from pluginLoad()
when a simulation is starting. That validateOptions()
method is only in the hierarchy for picongpu::plugins::multi::IHelp
, not for ISimulationPlugin
. Edit: the next sentence was wrong ~~(Side note: in all places, the method is accidentally hidden in the derived classes, but never used polymorphically as of now)~~. So we probably can put it to the hierarchies and override properly and call from the parsing in case that is what we want.
I think this is really the only proper way to check. Simple checks on the level of Boost.Program_options
do not work well for plugins, since the parameters are somewhat conditionally required, only when a plugin is enabled. Thus they cannot be made required in terms of Boost.Program_options
and now have to have default values.
Just to clarify:
yes, I got an error like this
std::runtime_error( name + ": parameter filter and period are not used the same number of times" );
even twice:
- first it crashed with a missing "file"-name
- I added calorimeter.file and ran a small case with the --verify flag --> no warnings shown
- and then crashed once more with a missing "filter"
And I may have overlooked it, but I believe, the first time there was only the "missing filename"-error and no indication that filter is required too.
In my opinion a default for filter and filename would do no harm. Users who want to use another filter will know it, and most newbie users don't need to bother...
I'll bring this issue up again, since it pertains strongly to usability of the software:
We need a validation of plug in options - at the very least the defaults - when executing with --validate
.
Specific example I observed: GIven that openPMD - the now de-facto library for outputs in PIConGPU - was compiled only with HDF5 and ADIOS1 support (ADIOS 2 compilation was behaving weirdly) and ADIOS1 not being supported it should not be possible to run PIConGPU with output/checkpointing at all.
In the specifica case, providing only the outputs I'd like to see (--openPMD.sources
etc.) and the frequency with which they are to be dumped, as well as specifying the frequency of checkpointing passed --validate
without complaints, but after running for ~40 min failed when writing checkpoints complaining that ADIOS1 is not supported as backend (checkpointing was set to start just before the 'init phase' of the simulation ends).
The way I see it this is not a problem of a user-supplied parameter which needs to be propagated to the proper plug-in, but a problem of a plug-in's own default settings.
Would it actually be possible to trigger a simple validation of the plug-ins default parameters without parsing the command line values when running with --validate
?
I think we can do proper parsing if there is a hook for plugins, similar to a hook to add command-line parameters. It can be either specifically for validation (and then it can be optional), or it can be specifically as the only place to grab and validate parameters from the command-line-args class. The latter would be more work, but more thorough checks.