common-definitions Proposal for for data validation syntax

This PR proposes a syntax for data validation as part of the scenario-processing infrastructure.

This PR is intended as a minimum viable product for scenario data validation. This feature is not yet supported by the nomenclature package, but will be added as a new class DataValidator once we reach agreement about the syntax.

The proposed syntax tries to strike a balance between readability and flexibility, using a nested yaml-style syntax to define

filters: any of model, scenario, region, variable, unit, year
bounds: upper_bound, lower_bound [value, rtol to be supported]

Any datapoint in an IAMC-style timeseries format matching the given filters must satisfy the bounds, otherwise an error is raised. The structure directly matches the signature of the method IamDataFrame.validate() so that the implementation can build on the existing functionality. For simplicity, alternative kwargs (value, rtol) will be added to the validate() method for more direct configuration.

The syntax works as follows:

yaml and csv files in a folder validation in a workflow repository (or subfolders)
a list of yaml dictionaries with (some of) the arguments specified above (see the final-energy prototype)
(optional) a nested structure where arguments in the upper level (variable in the emissions prototype) are combined with all lower-level dictionaries (years and regions)
a file attribute in the yaml dictionary to import validation attributes from a csv file (with # as comment)

# simple validation item
- filter-dimension: filter-value-A
  validation-argument: validation-value-A

# named validation item
- <description of validation item B>:
    filter-dimension: filter-value-B
    validation-argument: validation-value-B

# named nested validation items
- <description of validation item C>:
    filter-dimension: filter-value-C
    validation-argument: validation-value-C
    <description of nested validation D>:
      filter-dimension: filter-value-D
      validation-argument: validation-value-D
    <description of nested validation E>:
      filter-dimension: filter-value-E
      validation-argument: validation-value-E

This structure will yield four validation items:

A (name: None)
B (name: description of validation item B)
C & D (name: description of validation C - description of nested validation D)
C & E (name: description of validation C - description of nested validation E)

The name could be used when reporting failed validation of a scenario.

Going forward, we can also implement more features

a keyword argument required.
direct import of a csv file with all relevant attributes (risk of duplication and

Jul 10 '24 12:07 danielhuppmann

@phackstock @gunnar-pik @Renato-Rodrigues @orichters @robertpietzcker - please let me know if this is a useful step towards automated validation of scenario submissions...

Jul 10 '24 12:07 danielhuppmann

Maybe for your inspiration, @pweigmann has worked on a similar approach with a config file that looks like this.

I like the following features of our approach:

matching of variables: Price|** means all sub-variables have to satisfy a condition such as having min = 0
Price|* with just on * means only one chain, so matches Price|Final Energy but not Price|Final Energy|Industry
scenario-specific variables (such as net Zero 2050, or Temperature|Global Mean < 1.5 in 2100 for a Below 1.5°C scenario)
comparison between scenarios (all variables in all scenarios must be equal for period <= 2020, for example)
comparison to reference periods (in 2030, not more than 20% reduction compared to the year 2020, for example)
The yaml format seems very nice.

Jul 10 '24 13:07 orichters

Thanks @orichters, yes, I've seen your format before and we want to develop in this direction too (and I hope that the yaml file is less heavy and more reliable for forward/backward compatibility).

matching of variables: Price|** means all sub-variables have to satisfy a condition such as having min = 0

Price|* with just on * means only one chain, so matches Price|Final Energy but not Price|Final Energy|Industry

This is already implemented where * is interpreted as a wildcard and you can pass a level argument to specify how "deep" the filter works on the hierarchy, see here.

scenario-specific variables (such as net Zero 2050, or Temperature|Global Mean < 1.5 in 2100 for a Below 1.5°C scenario)

I didn't consider it yet, but you can pass a "model" or "scenario" filter argument.

comparison between scenarios (all variables in all scenarios must be equal for period <= 2020, for example)

comparison to reference periods (in 2030, not more than 20% reduction compared to the year 2020, for example)

Very useful suggestions, to be implemented in the future.

Jul 10 '24 13:07 danielhuppmann

Thanks @danielhuppmann very useful! Yes, great to loop in @pweigmann, who started this for COMMITTED and will also be involved in the SCI project, and also @PhilippVerpoort, who will join SCI as well. I prefer would editing the upper/lower threshold levels in a classical spreadsheet format like csv over working with yamls. SInce we will end up with a large number of entries, a table format would make it easier to keep an overview. So good to have functionality to read in csvs.

Jul 10 '24 13:07 gunnar-pik

Hello @danielhuppmann , always fascinating to see when different people come up with a similar solution to the same problem, it does invoke confidence that this type of tool can be useful! On the other hand, it also means a lot of parallel work in different languages, I suppose.

You can follow the current development efforts of our validation tool here: https://github.com/pik-piam/piamValidation

Don't hesitate to reach out in case you would like to exchange ideas or learn more about what we have done so far, I could see this being a great area for collaboration.

Jul 10 '24 14:07 pweigmann

Based on further discussions with @phackstock, I have modified the PR and the description (see at the top) to include a way to import a csv file but minimize duplication of columns/rows.

I also switched from upper_bound/lower_bound to value/rtol (still to be implemented in pyam) for better readibility.

Jul 16 '24 07:07 danielhuppmann

Looks very good to me, would be happy to implement it like this.

If we wanted (which I'm not sure we do) we could try to make the syntax of the validation file more compact. In the below proposal I've changed two things:

Moved the variable to be a top-level value
Put the individual validations as list items, so that they don't require a keyword anymore

- Emissions|CO2|Energy and Industrial Processes:
    - region: World
      rtol: 5%
      file: data_emissions_global.csv
    - region: Asia (R5)
      year: 2020
      rtol: 10%
      value: 20520

This would save 3 lines compared to the current proposal. If it makes readability worse we should stick to the current format though.

Jul 16 '24 08:07 phackstock

Thanks @phackstock - I'm hesitant to define any dimension implicitly: first, I think it's better for readibility to always write "variable: ...", and second, we may run into a use case where the variable is not the primary sorting dimension, which will then make life difficult...

Jul 16 '24 08:07 danielhuppmann

@danielhuppmann fair point about the variable. Regarding your point on having a use case where the variable is not the main dimension I'm not sure if we'd want to put everything into the same file anyway. If we're trying to make one format that fits every possible use case I'm afraid we'd end up with something pretty unwieldy.

What do you think about my second point of moving the constraints into a list rather than having to give them names? So doing this:

- Historical fossil CO2 emissions data:
  variable: Emissions|CO2|Energy and Industrial Processes
  constraints:
      - region: World
        rtol: 5%
        file: data_emissions_global.csv
      - region: Asia (R5)
        year: 2020
        rtol: 10%
        value: 20520

instead of:

- Historical fossil CO2 emissions data:
  variable: Emissions|CO2|Energy and Industrial Processes
  World:
    region: World
    rtol: 5%
    file: data_emissions_global.csv
  Asia (R5):
    region: Asia (R5)
    year: 2020
    rtol: 10%
    value: 20520

to me, using constraints (or any other keyword that might fit better) looks a bit cleaner and if there's a lot of constraints, you'd save a lot of lines and I think improve readability.

Jul 16 '24 08:07 phackstock

Short note: I think it is important that we can use multiple threshold levels, especially as we go to the vetting of neart-term projections - higher and lower, and also soft ones (yellow traffic light) and hard constraints (red traffic light). So would this be added as lim_lower_yellow, lim_upper_red or similar?

Jul 19 '24 08:07 gunnar-pik