common-definitions Backwards-compatibility of variable names

At the SWG meeting on 2023-12-06, Masa Sugiyama and others raised the idea of how to support backward-compatibility if it becomes necessary to change a variable name.

This issue is to discuss/collect ideas.

Dec 06 '23 13:12 khaeru

My suggestion:

In the NAVIGATE project, we made use of the fact that the nomenclature package (which is used with this repo) tolerates (or reads and stores?) extra entries in code lists. For example, we had:
```
- NAV_Dem-20C-all_u:
    navigate_task: T3.5
    navigate_climate_policy: 20C
    navigate_T35_policy: act+ele+tec
```
In this, navigate_T35_policy is like description, units, or other attributes.
This is analogous to/imitates the SDMX concept of an Annotation.
We should simply specify a common annotation ID that would contain 1 or a list of older/superseded/alias variable names. For instance:
```
- Final Energy|Foo|Bar:
    iamc-variable-superseded: |
      Final Energy|Bar|Foo
      Final Energy|Foo Bar
```
It could be iamc-variable-synonym, iamc-variable-old, or anything—I don't have any strong preference here.
Code that needs to handle older data could then access these annotations for info on the correspondence of old and current names, for instance to construct a "mapping" or "table", perform replacement, or whatever makes sense in a particular implementation.

A minimal working example (MWE) using SDMX:

import sdmx
import sdmx.model.v21 as m

# Create a Code whose ID is a current variable name
c = m.Code(id="Final Energy|Foo|Bar")

# Create an annotation containing old/superseded variable names
ann = m.Annotation(
    id="iamc-variable-old",
    text="\n".join(
        ["Final Energy|Bar|Foo", "Final Energy|Foo Bar"],
    )
)
c.annotations.append(ann)

# Write to file
cl = m.Codelist(id="VARIABLE", name="IAMC variable name")
cl.append(c)
msg = sdmx.message.StructureMessage()
msg.add(cl)
with open("example.xml", "wb") as f:
    f.write(sdmx.to_xml(msg, pretty_print=True))

This gives output like:

…
  <str:Code id="Final Energy|Foo|Bar">
    <com:Annotations>
      <com:Annotation id="iamc-variable-old">
        <com:AnnotationText xml:lang="en">Final Energy|Bar|Foo
Final Energy|Foo Bar</com:AnnotationText>
      </com:Annotation>
    </com:Annotations>
  </str:Code>

And can be read and used like:

# Read the file, retrieve the codelist
>>> msg = sdmx.read_sdmx("example.xml")
>>> cl = msg.codelist["VARIABLE"] 

# Retrieve a specific variable name
>>> c = cl["Final Energy|Foo|Bar"]
>>> c
<Code Final Energy|Foo|Bar>

# Retrieve the list of old names from the annotation
>>> c.eval_annotation("iamc-variable-old").split("\n")
['Final Energy|Bar|Foo', 'Final Energy|Foo Bar']

Dec 06 '23 13:12 khaeru

Do I understand it right that you say we can in principle add as many entries as we want? The old examples of the ENGAGE and NAVIGATE template only seem to have the entries "description" and "unit", but you say we could also add extra entries for storing the 'old' name. And then similarly, we could also create extra entries to denote maximum and minimum allowed per-capita values, and aliases with other data structures (e.g. the iTEM transport variable names or similar).

Dec 06 '23 21:12 christophbertram

@christophbertram I say we should agree on as many common annotations as we need, and that doing so is a feature of the SDMX standard (and supported by tools that implement it). What I don't know is whether the nomenclature tool that @phackstock and @danielhuppmann have developed supports access and use of such annotations: I only know we can put such entries in YAML files such as appear in this repo and they will be tolerated by nomenclature, i.e. it won't error when trying to read the files.

Per full-resolution keys: yes, exactly. I hope we can provide a proof-of-concept when linking the iTEM structure info to this repo.

Per "minimum and maximum allowed values per capita"—I think that is actually data, not structure. You can imagine an IAMC-structured table (or with fewer or more dimensions, e.g. possibly without YEAR or REGION) in which the numbers are not "actual observed historical values" nor "model-projection values" but "expected {minimum,maximum} per capita values". One could imagine having different sets of such values for different purposes, even when the same variable names are used.

Dec 07 '23 08:12 khaeru

Thanks for raising this issue, see a few comments below. Let's please try to keep issues and discussions narrow and start new issues where possible.

Cross-reference to legacy variables/regions or other standards: this is already implemented in a simple example here, see https://github.com/IAMconsortium/common-definitions/blob/3f530e2a37a649dfc011cbf5e6696d1da2e2cdae/definitions/variable/energy/final-energy.yaml#L115 and the value can be accessed from the nomenclature.DataStructureDefinition as

dsd.variable["Final Energy|Carbon Removal|Direct Air Capture|Electricity"].navigate

If you have specific suggestions for feature-support in nomenclature, e.g. as a "known" attribute with dedicated documentation, please start an issue there.

Validation of values should indeed be handled as a separate use-case and will be implemented similar to the required-data feature in nomenclature, see here. This PR https://github.com/IAMconsortium/pyam/pull/804 is a step towards support for that feature. The main reason for keeping this separate is that different projects may want to use different reference data or validation thresholds.

Dec 11 '23 08:12 danielhuppmann

I think this is partly fixed by yesterday's Daniel commit #PR61

Feb 16 '24 09:02 FlorianLeblancDr