nomenclature Add failing test cases to illustrate potentially conflicting information

Add failing test cases to illustrate potentially conflicting information

Open phackstock opened this issue 1 year ago • 0 comments

Closes #290.

@danielhuppmann, I took a look at the questions you brought up in #290 and I think we should be good.

The case that you described would be as follows. Given a model mapping:

model: m_a
native_regions: [region_A, region_B]
common_regions:
  - region_C: [region_A, region_B]

with a variable code list:

- Variable A:
    definition: Test variable to be used for computing a max aggregate
    unit: EJ/yr
    region-aggregation:
        - Variable A (max):
            method: max
- Variable A (max):
    unit: EJ/yr

and input data:

IamDataFrame(
            pd.DataFrame(
                [
                    ["m_a", "s_a", "region_A", "Variable A", "EJ/yr", 1],
                    ["m_a", "s_a", "region_B", "Variable A", "EJ/yr", 1],
                    ["m_a", "s_a", "region_A", "Variable A (max)", "EJ/yr", 2],
                    ["m_a", "s_a", "region_B", "Variable A (max)", "EJ/yr", 1],
                ],
                columns=IAMC_IDX + [2020],
            )
        )

yields a pyam error for duplicate data:

E       ValueError: Duplicate rows in `data`:
E         model scenario    region          variable   unit  year
E       0   m_a      s_a  region_C  Variable A (max)  EJ/yr  2020

meaning that as expected both operations are attempted. The aggregation of Variable A (max) though the region-aggregation attribute in Variable A as well as the "standard" aggregation from the entry Variable A (max). This case is safe though since pyam yields an error. We could specifically protect against it but I'd say it's fine.

There might be more cases to consider though.

Only `Variable A (max)`

Take the above data but eliminate the first two lines for Varible A. In this case we'd get the following aggregation result:

  model scenario    region          variable   unit  year  value
0   m_a      s_a  region_A  Variable A (max)  EJ/yr  2020      2
1   m_a      s_a  region_B  Variable A (max)  EJ/yr  2020      1
2   m_a      s_a  region_C  Variable A (max)  EJ/yr  2020      3

for region_C, we now get 3 which is the sum, not the max of region_A and region_B. This is wrong but expected since there is no method set for the aggregation of Variable A (max). We could safeguard against that relatively easy by enforcing that aggregation methods between the region-aggregation attribute and the "normal" variable must be the same. So:

- Variable A:
    definition: Test variable to be used for computing a max aggregate
    unit: EJ/yr
    region-aggregation:
        - Variable A (max):
            method: max
- Variable A (max):
    unit: EJ/yr
    method: max

in the above example. We could also make it more simple and remove the method attribute from the variable inside the region-aggregation attribute so that the method information is taken from the main variable directly.

Only `Variable A`

This is the straightforward version of the above case but I wanted to mention it. Taking only the first two rows of data gives:

  model scenario    region          variable   unit  year  value
0   m_a      s_a  region_A        Variable A  EJ/yr  2020      1
1   m_a      s_a  region_B        Variable A  EJ/yr  2020      1
2   m_a      s_a  region_C  Variable A (max)  EJ/yr  2020      1

which is correct and what we expect.

`Variable A (max)` in aggregation region

The final case that I could find is this one:

IamDataFrame(
            pd.DataFrame(
                [
                    ["m_a", "s_a", "region_A", "Variable A", "EJ/yr", 1],
                    ["m_a", "s_a", "region_B", "Variable A", "EJ/yr", 1],
                    ["m_a", "s_a", "region_C", "Variable A (max)", "EJ/yr", 2],
                ],
                columns=IAMC_IDX + [2020],
            )
        )
    )

where Variable A (max) exists but for the common region region_C. In this case we also don't get an error since the provided data always takes precedence over aggregated and we get:

  model scenario    region          variable   unit  year  value
0   m_a      s_a  region_A        Variable A  EJ/yr  2020      1
1   m_a      s_a  region_B        Variable A  EJ/yr  2020      1
2   m_a      s_a  region_C  Variable A (max)  EJ/yr  2020      2

with the warning that there is a difference between aggregated and provided data for region_C.

Summary

The case described by you in #290, would throw a pyam error and since I've never seen it so far I'd say we can ignore that case.
The only other case we maybe should be safeguarding against is conflicting information between the variable mentioned in the region-aggregation attribute and the "original" variable entry. One way out of this could be to only allow mentioning the variable name in region-aggregation, all other information is then read from the original entry.

@danielhuppmann, looking forward to your thoughts. I think I've thought through every case but please let me know if you've spotted an error.

Jan 25 '24 17:01 phackstock

nomenclature nomenclature copied to clipboard

Add failing test cases to illustrate potentially conflicting information

Only Variable A (max)

Only Variable A

Variable A (max) in aggregation region

Summary

nomenclature
nomenclature copied to clipboard

Only `Variable A (max)`

Only `Variable A`

`Variable A (max)` in aggregation region