BOUT-dev
BOUT-dev copied to clipboard
Replace DataFile/Dataformat with OptionsNetCDF
Replace all the existing DataFile
/Dataformat
infrastructure with OptionsNetCDF
Closes #1254
Currently:
- restart files moved to
OptionsNetCDF
- additional
BOUT.dmp.experimental.*.nc
files usingOptionsNetCDF
-- separate files for now just for testing! - time-evolving variables added to
Solver
are written - build config is written
- method for checking all time-evolving variables have the same sized time dimension
Missing:
- ~user-added variables, using e.g.
SAVE_ONCE
orSAVE_REPEAT
~ - upgrade tool -- EDIT: this might not happen, as it's seems quite tricky to automatically upgrade
- direct read/write method -- direct read not yet implemented, but direct write can be done with e.g.
OptionsNetCDF{}.write({{"key": value}})
- ~single-method for adding a time-evolving variable to
Options
~
One possibly blocking issue at the moment is that OptionsNetCDF
writes all the keys in lowercase. This causes problems with e.g. collect
, which expects BOUT_VERSION
to be in uppercase, so thinks the files are pre-v0.2! That's a simple fix for that one variable in collect
, but I am a little bit worried more generally about what tools expect certain variables in the files to be written in uppercase. @johnomotani what does xBOUT do?
We also need to decide on exactly what the interface will be for users to add variables. In the current form, there's now an Options
member of PhysicsModel
, output_options
, which gets written to file in PhysicsModel::PhysicsModelMonitor::call
.
My feeling is we probably want a PhysicsModel::outputVars(Options&)
method which gets called by PhysicsModelMonitor
.
what tools expect certain variables in the files to be written in uppercase. @johnomotani what does xBOUT do?
xarray/xBOUT are case sensitive at the moment. I guess we could work around by renaming everything to lower-case when we load. I'd prefer to keep the output case-sensitive though (otherwise for example variable names will be converted to lower-case when printed, which I would find surprising as a user).
Actually I would vote for making Options case-sensitive (I find that behaviour less surprising myself, and it makes the code slightly simpler), although it's not backward compatible. If we make warnings about unused options more prominent (e.g. print at the beginning of the run, after the first rhs call, and also at the end of the run), I think that would mitigate the worry about getting unexpected behaviour because an option was typed with the wrong case.
I do think case-sensitive Options
should be considered, though I would be more keen on it if we had a better way of dealing with unused/unknown/mistyped options.
If we didn't want to do that, we could probably modify Options
to keep the original case but reads were case-insensitive by changing the option name from std::string
to something like:
struct CaseInsensitiveKey {
std::string original_case; // Store the original case as well as the lower case version
std::string lower_case;
CaseInsensitiveKey(const std::string& key) : original_case(key), lower_case(lowercase(key)) {};
// Implicitly converts to std::string
operator const std::string& () const { return original_case; }
// Order by the original case -- needed for std::map
bool operator< (const std::string& other) { return other < original_case; }
// When checking for equality, check the original case, then lowercase
bool operator== (const std::string& other) {
if (other == original_case) {
return true;
}
return lowercase(other) == lower_case;
}
};
Also needs assignment, copy, move operators, but that would probably allow us to keep the case-insensitivity when reading the options file, but have case sensitivity when writing. But more code, more chance for things to go wrong.
I agree it's tricky, and case sensitivity simplifies lots of things in the code. On the other hand I would not want to have keys which differed by case, or users thinking they had changed an input where they hadn't (because they got the case wrong).
I worry that ordering by original case, and then comparing against both strings would break the behavior of maps e.g. a red-black tree might not match the case insensitive case if they've followed the ordering using the original case.
Perhaps a variation on your suggestion:
struct CaseInsensitiveKey {
std::string original_case; // Store the original case as well as the lower case version
std::string lower_case;
CaseInsensitiveKey(const std::string& key) : original_case(key), lower_case(lowercase(key)) {};
// Implicitly converts to std::string
operator const std::string& () const { return original_case; }
// Order by the lower case -- needed for std::map
bool operator< (const std::string& other) { return lowercase(other) < lower_case; }
// When checking for equality, check lowercase
bool operator== (const std::string& other) {
return lowercase(other) == lower_case;
}
};
So we do all comparisons in lower case, but retain the original case
Quick question: If the keys to the map are CaseInsensitiveKey
do we need comparisons to strings, or just between CaseInsensitiveKey
s? That would eliminate conversion to lower case, apart from in the constructor
If we went case-sensitive, how about something like this to help avoid errors:
- Have an option like
strict
(true or false, default to false), that makes it an error to run with unused options - that should catch any typos or attempts to use the wrong case. I thinkOptions
already has some method to flag unused options - we could check whether there are any after the first rhs call and raise an error if there are, printing the unused options. - An option like
validate_input
. To start with this could be equivalent to justnout=0 strict
, so that a user can check that their input file is OK by running something like
Over time we could maybe add some extras, like if there is an unused option searching for similar options (e.g. different case, one-letter typos) and ask if those were intended../mymodel validate_input
I agree keys that differ by case are horrible, but I'd be inclined to leave it as just bad practice rather than make it an error. If you'd rather be certain, we could add a check like if (new_key.lower() == existing_key.lower()) throw BoutException(...)
?
I think the strict
check @johnomotani suggests would be useful whether or not keys are case sensitive. I'd support making that true
by default. In which case that removes one of the arguments for case insensitivity.
I quite like your idea about checking when creating keys if they match existing keys case-insensitively, but currently keys are created automatically if they don't already exist, so this might need to be done every time an Options operator[]
is called, with a scan and comparison against all the keys in the dictionary. Perhaps this is not needed if we're catching unused keys.
Perhaps this is something as you say that should be a good-practice thing rather than implemented in the code. We'd need to decide some convention on when to use upper case (or always lower case?)
this might need to be done every time an Options operator[] is called
I think it'd only be needed when (possibly) creating a new key - if the requested key is already in the map, then we'd return before getting to the check.
Ok, I will make a separate PR with modifications to case sensitivity:
- remove case insensitivity (breaking change)
- add
strict
(or better:input:strict
?) option, defaulting totrue
, which checks for unused options- this could be backported, with default
false
- where should check be? In
Solver::setModel
, after call toPhysicsModel::initialise
? Or inSolver::solve
before call toSolver::run
?
- this could be backported, with default
- add
validate_input
(or better:input:validate
?), which just stops after initialstrict
check (and crucially, doesn't write to file!)
Adding a simple Levenshtein distance would allow us to have more helpful feedback: "tpyo_option
not found: did you mean typo_option
?"
Thanks @ZedThree this sounds good thanks!
- I vote for
input:strict
andinput:validate
to keep things namespaced, and gradually reduce the number of switches outside sections - It needs to be checked after the RHS has been called once. Perhaps here: https://github.com/boutproject/BOUT-dev/blob/master/src/solver/solver.cxx#L476
but we would need to move the
run_rhs
call to before thedump_on_restart
check, so that it is always run. - Definitely yes to making the inputs more helpful. For each unused key, we could loop through the keys which were set (presumably to default values) and suggest the closest match.
input:strict
sounds good, and I guess input:validate
is the same number of characters as validate_input
, so is better. To make validate
more discoverable, could we maybe add a command-line option that comes up when you do ./mymodel --help
?
Yes, good idea! I'd really love for ./mymodel --help
to list all the possible options, but I think that would require a massive overhaul of how we used options. It could probably be done using the init-time registration similar to how the factories work.
We could have input:validate
write the BOUT.settings file before exiting. That should then contain all the settings, types and documentation.
Oh yes, that's probably good enough. Maybe just print it to screen, to avoid clobbering an existing BOUT.settings file?
This is almost ready to remove the bout::globals::dump
instance, as well all of Datafile
and its implementations.
There are (at least!) two remaining sticking points:
-
LaplaceNaulin
andLaplaceIPT
have a couple of performance diagnostics that they add tobout::globals::dump
. These are a little bit tricky to refactor. My current idea is to copy an idea fromLaplaceXY
: have asavePerformance(Solver& solver)
method that, when called, adds aMonitor
tosolver
. Then the monitor hasoutputVars(Options&)
where it can add whatever it likes. The issue is thatsavePerformance
would have to be a virtual method on the base class for this to be useful really. - The
subsampling
example probably needs to be completely rewritten, as it's heavily tied to theDatafile
interface.
I need to write some docs on all the changes, so not really ready to be merged, but otherwise the code is ready for review.
This is a pretty big PR, so I'd happy to break it up into some smaller pieces. The major changes, in rough dependency order, are:
- modifications to
Options
:-
assignRepeat
to handle"time_dimension"
attribute nicely - always store location/direction attributes for
Fields
- specialisations of
Options::as<>
forArray/Matrix/Tensor
-
- modifications to
OptionsNetCDF
:- handle floats and shorts
- better handling of multiple time dimensions
- helper functions for writing the main/default output file (
BOUT.dmp.X.nc
)
- use
OptionsNetCDF
instead ofDataFormat
inGridFile
- add
outputVars(Options&)
to various classes - add
savePerformance
, monitor toLaplacian/LaplaceXY
to replace use ofbout::globals::dump
- using
OptionsNetCDF
instead ofbout::globals::dump
in non-model based tests - use
OptionsNetCDF
instead ofDatafile
inPhysicsModel
- also involves modifications to
Solver
to support this -
DataFileFacade
to makeSAVE_ONCE/REPEAT
still work insidePhysicsModel
- fixes for
boutcore
- also involves modifications to
- actual removal of
Datafile
/DataFormat
Most bullet points could be standalone PRs.
Even with all these changes, this PR is net -7k lines!
For a future PR: refactor GridDataSource
.
Failing test is clang-tidy-review because there's a bug in clang-tidy that makes it choke on LaplaceIPT
for some reason.
One of the things I don't especially like about the design here is the outputVars
method that is now in lots of places:
- the name isn't great, I chose it because there were already existing methods with that name
- the
Options& output_file
is an in-out parameter
Both of these can be changed. The second one might require a bit more machinery -- we can write an Options
to an existing file, appending new variables and overwriting existing ones, but it might be nicer to be able to merge into an existing Options
and minimise the number of separate writes to disk.
I've also not benchmarked this PR against next
. I suspect this may be marginally slower, but perhaps the reduction in complexity is worth it if it's not too much slower.
Some timing information from blob2d, run on 16 cores, averaged over three runs (all the numbers have been averaged, I've not recalculated the "% of top" or "Mean time" columns to be consistent). I've added some extra timers to places. In the below tables, io
is inclusive of the other io-
rows, which are all exclusive subsets:
-
io-write
: eitherOptionsNetCDF::write()
orDatafile::write()
-
io-outputvars
: every function namedoutputVars
plussetupDumpFile
innext
-
io-verify
:OptionsNetCDF::verifyTimesteps
-- this is called once per output timestep to check all time dimensions in the output file are the same length. This could be an input option to do this.
The following are non-exclusive subsets:
-
io-add
:Datafile::add
, so should be ~io-outputvars
-
io-ncxx4
: anything inside theNcxx4
class
Here's this branch:
Timer name | Total time (s) | % of top | Hits | Mean time/hit (s) |
---|---|---|---|---|
run | 97.6347 | 1 | 1 | 97.6347 |
rhs | 72.2145 | 0.74 | 1555 | 0.0464402 |
invert | 16.883 | 0.17 | 1555 | 0.0108572 |
io | 2.97422 | 0.03 | 138 | 0.0215523 |
io-write | 2.55597 | 0.03 | 67 | 0.0381488 |
comms | 0.549516 | 0.01 | 12616 | 4.35571e-05 |
io-findDimension | 0.00349238 | 0 | 1045 | 3.34199e-06 |
io-outputvars | 0.0329373 | 0 | 48 | 0.000686194 |
io-verify | 0.384566 | 0 | 11 | 0.0349605 |
Standard deviation for the total time was 2.2s, and 0.15 for io
Here's next
(with extra timers):
Timer name | Total time (s) | % of top | Hits | Mean time/hit (s) |
---|---|---|---|---|
run | 97.1441 | 1 | 1 | 97.1441 |
rhs | 74.0155 | 0.76 | 1555 | 0.0475984 |
invert | 17.2017 | 0.18 | 1555 | 0.0110622 |
io | 2.04389 | 0.02 | 33 | 0.0619361 |
io-ncxx4 | 1.64624 | 0.02 | 1960 | 0.000839917 |
comms | 0.588739 | 0.01 | 12616 | 4.66661e-05 |
io-add | 1.24521 | 0.01 | 153 | 0.00813861 |
io-outputvars | 1.18999 | 0.01 | 5 | 0.237999 |
io-write | 0.763503 | 0.01 | 22 | 0.0347047 |
io-open | 0.0287595 | 0 | 2 | 0.0143797 |
Standard deviation for the total time was 7.7s, and 0.21 for io
So it looks like this branch is a bit slower, and the overheads are inside OptionsNetCDF::write
rather than in the copying variables into the Options
in outputVars
.
Also, this was all done with debug builds. A quick look at optimised builds shows io
dropping a little, but total run time is now ~10 s, so io
fraction goes up a lot.
This now has #2471 merged in, as I was testing the performance with it.
I think I've now eliminated the performance hit this PR was introducing -- it seems to have been due to re-creating the NcFile
with every call to OptionsNetCDF::write
(i.e. every timestep). Now, the NcFile
is a member of OptionsNetCDF
and the file is opened on demand and synced every timestep.
OptionsNetCDF::read
still creates a new NcFile
object (i.e. immediately opens a file on disk, reads the contents, and then closes it).
I think the interaction between reading and writing here should be fine, but is a potential cause for concern. It would probably only affect restart files, and we shouldn't have them open for both reading and writing at the same time, so probably fine.
Fixed conflicts and an actual bug that clang-tidy caught. This is now ready to go in
I'm merging this now so that we can get some actual experience with it in next before the imminent 5.0 release