Proposal for for data validation syntax
This PR proposes a syntax for data validation as part of the scenario-processing infrastructure.
This PR is intended as a minimum viable product for scenario data validation. This feature is not yet supported by the nomenclature package, but will be added as a new class DataValidator once we reach agreement about the syntax.
The proposed syntax tries to strike a balance between readability and flexibility, using a nested yaml-style syntax to define
- filters: any of model, scenario, region, variable, unit, year
- bounds: upper_bound, lower_bound [value, rtol to be supported]
Any datapoint in an IAMC-style timeseries format matching the given filters must satisfy the bounds, otherwise an error is raised. The structure directly matches the signature of the method IamDataFrame.validate() so that the implementation can build on the existing functionality. For simplicity, alternative kwargs (value, rtol) will be added to the validate() method for more direct configuration.
The syntax works as follows:
- yaml and csv files in a folder validation in a workflow repository (or subfolders)
- a list of yaml dictionaries with (some of) the arguments specified above (see the final-energy prototype)
- (optional) a nested structure where arguments in the upper level (variable in the emissions prototype) are combined with all lower-level dictionaries (years and regions)
- a
fileattribute in the yaml dictionary to import validation attributes from a csv file (with#as comment)
# simple validation item
- filter-dimension: filter-value-A
validation-argument: validation-value-A
# named validation item
- <description of validation item B>:
filter-dimension: filter-value-B
validation-argument: validation-value-B
# named nested validation items
- <description of validation item C>:
filter-dimension: filter-value-C
validation-argument: validation-value-C
<description of nested validation D>:
filter-dimension: filter-value-D
validation-argument: validation-value-D
<description of nested validation E>:
filter-dimension: filter-value-E
validation-argument: validation-value-E
This structure will yield four validation items:
- A (name: None)
- B (name: description of validation item B)
- C & D (name: description of validation C - description of nested validation D)
- C & E (name: description of validation C - description of nested validation E)
The name could be used when reporting failed validation of a scenario.
Going forward, we can also implement more features
- a keyword argument required.
- direct import of a csv file with all relevant attributes (risk of duplication and
@phackstock @gunnar-pik @Renato-Rodrigues @orichters @robertpietzcker - please let me know if this is a useful step towards automated validation of scenario submissions...
Maybe for your inspiration, @pweigmann has worked on a similar approach with a config file that looks like this.
I like the following features of our approach:
- matching of variables:
Price|**means all sub-variables have to satisfy a condition such as havingmin = 0 Price|*with just on*means only one chain, so matchesPrice|Final Energybut notPrice|Final Energy|Industry- scenario-specific variables (such as net Zero 2050, or
Temperature|Global Mean < 1.5in 2100 for aBelow 1.5°Cscenario) - comparison between scenarios (all variables in all scenarios must be equal for period <= 2020, for example)
- comparison to reference periods (in 2030, not more than 20% reduction compared to the year 2020, for example)
- The yaml format seems very nice.
Thanks @orichters, yes, I've seen your format before and we want to develop in this direction too (and I hope that the yaml file is less heavy and more reliable for forward/backward compatibility).
- matching of variables: Price|** means all sub-variables have to satisfy a condition such as having min = 0
- Price|* with just on * means only one chain, so matches Price|Final Energy but not Price|Final Energy|Industry
This is already implemented where * is interpreted as a wildcard and you can pass a level argument to specify how "deep" the filter works on the hierarchy, see here.
- scenario-specific variables (such as net Zero 2050, or Temperature|Global Mean < 1.5 in 2100 for a Below 1.5°C scenario)
I didn't consider it yet, but you can pass a "model" or "scenario" filter argument.
- comparison between scenarios (all variables in all scenarios must be equal for period <= 2020, for example)
- comparison to reference periods (in 2030, not more than 20% reduction compared to the year 2020, for example)
Very useful suggestions, to be implemented in the future.
Thanks @danielhuppmann very useful! Yes, great to loop in @pweigmann, who started this for COMMITTED and will also be involved in the SCI project, and also @PhilippVerpoort, who will join SCI as well. I prefer would editing the upper/lower threshold levels in a classical spreadsheet format like csv over working with yamls. SInce we will end up with a large number of entries, a table format would make it easier to keep an overview. So good to have functionality to read in csvs.
Hello @danielhuppmann , always fascinating to see when different people come up with a similar solution to the same problem, it does invoke confidence that this type of tool can be useful! On the other hand, it also means a lot of parallel work in different languages, I suppose.
You can follow the current development efforts of our validation tool here: https://github.com/pik-piam/piamValidation
Don't hesitate to reach out in case you would like to exchange ideas or learn more about what we have done so far, I could see this being a great area for collaboration.
Based on further discussions with @phackstock, I have modified the PR and the description (see at the top) to include a way to import a csv file but minimize duplication of columns/rows.
I also switched from upper_bound/lower_bound to value/rtol (still to be implemented in pyam) for better readibility.
Looks very good to me, would be happy to implement it like this.
If we wanted (which I'm not sure we do) we could try to make the syntax of the validation file more compact. In the below proposal I've changed two things:
- Moved the variable to be a top-level value
- Put the individual validations as list items, so that they don't require a keyword anymore
- Emissions|CO2|Energy and Industrial Processes:
- region: World
rtol: 5%
file: data_emissions_global.csv
- region: Asia (R5)
year: 2020
rtol: 10%
value: 20520
This would save 3 lines compared to the current proposal. If it makes readability worse we should stick to the current format though.
Thanks @phackstock - I'm hesitant to define any dimension implicitly: first, I think it's better for readibility to always write "variable: ...", and second, we may run into a use case where the variable is not the primary sorting dimension, which will then make life difficult...
@danielhuppmann fair point about the variable. Regarding your point on having a use case where the variable is not the main dimension I'm not sure if we'd want to put everything into the same file anyway. If we're trying to make one format that fits every possible use case I'm afraid we'd end up with something pretty unwieldy.
What do you think about my second point of moving the constraints into a list rather than having to give them names? So doing this:
- Historical fossil CO2 emissions data:
variable: Emissions|CO2|Energy and Industrial Processes
constraints:
- region: World
rtol: 5%
file: data_emissions_global.csv
- region: Asia (R5)
year: 2020
rtol: 10%
value: 20520
instead of:
- Historical fossil CO2 emissions data:
variable: Emissions|CO2|Energy and Industrial Processes
World:
region: World
rtol: 5%
file: data_emissions_global.csv
Asia (R5):
region: Asia (R5)
year: 2020
rtol: 10%
value: 20520
to me, using constraints (or any other keyword that might fit better) looks a bit cleaner and if there's a lot of constraints, you'd save a lot of lines and I think improve readability.
Short note: I think it is important that we can use multiple threshold levels, especially as we go to the vetting of neart-term projections - higher and lower, and also soft ones (yellow traffic light) and hard constraints (red traffic light). So would this be added as lim_lower_yellow, lim_upper_red or similar?