pyam icon indicating copy to clipboard operation
pyam copied to clipboard

Regression in require behaviour

Open znicholls opened this issue 8 months ago • 6 comments

require_data is not a drop in replacement for require_variable. This leads to a regression in behaviour with no easy fix for users.

See script below for demonstration.

Script
import numpy as np
import pandas as pd
import pyam


test = pd.DataFrame(
    np.ones((8, 3)),
    columns=[2010, 2015, 2020],
    index=pd.MultiIndex.from_tuples(
        [
            (
                "scenario_a",
                "model_a",
                "Emissions|CO2|Waste",
                "World",
                "GtC / yr",
            ),
            (
                "scenario_a",
                "model_a",
                "Emissions|CO2|Other",
                "World",
                "GtC / yr",
            ),
            (
                "scenario_b",
                "model_a",
                "Emissions|CO2|Waste",
                "World",
                "GtC / yr",
            ),
            (
                "scenario_b",
                "model_a",
                "Emissions|CO2|Industrial",
                "World",
                "GtC / yr",
            ),
            (
                "scenario_a",
                "model_b",
                "Emissions|CO2|Other",
                "World",
                "GtC / yr",
            ),
            (
                "scenario_a",
                "model_b",
                "Emissions|CO2|Industrial",
                "World",
                "GtC / yr",
            ),
            (
                "scenario_b",
                "model_b",
                "Emissions|CO2|AFOLU",
                "World",
                "GtC / yr",
            ),
            (
                "scenario_b",
                "model_b",
                "Emissions|CO2|Industrial",
                "World",
                "GtC / yr",
            ),
        ],
        names=[
            "scenario",
            "model",
            "variable",
            "region",
            "unit",
        ],
    ),
)
test = pyam.IamDataFrame(test)

if pyam.__version__.startswith("2"):
    matches_requirements = test.require_data(
        variable=["Emissions|CO2|Other", "Emissions|CO2|Waste"], exclude_on_fail=True
    )
    print("3 scenarios fail (the ones that don't have BOTH requirement)")
    print(test.exclude)
    assert test.exclude.sum() == 1

else:
    matches_requirements = test.require_variable(
        variable=["Emissions|CO2|Other", "Emissions|CO2|Waste"], exclude_on_fail=True
    )
    print("Only 1 scenario fails (the one that doesn't have EITHER requirement)")
    print(test.meta)
    assert test.meta["exclude"].sum() == 1
Behaviour with pyam-iamc 2.0 and require_data
% pip list | grep pyam-iamc && python scratch.py
pyam-iamc               2.0.0

3 scenarios fail (the ones that don't have BOTH requirement)
model    scenario  
model_a  scenario_a    False
         scenario_b     True
model_b  scenario_a     True
         scenario_b     True
dtype: bool
Traceback (most recent call last):
  File ".../scratch.py", line 85, in <module>
    assert test.exclude.sum() == 1
           ^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
Behaviour with pyam-iamc 1.9 and require_variable
% pip list | grep pyam-iamc && python scratch.py
pyam-iamc          1.9.0

Only 1 scenario fails (the one that doesn't have EITHER requirement)
                    exclude
model   scenario           
model_a scenario_a    False
        scenario_b    False
model_b scenario_a    False
        scenario_b     True

I think the basic difference is that require_variable did an OR requirement (any match was marked as a match). require_data is an AND requirement (all requirements had to match in order to be marked as match).

znicholls avatar Oct 24 '23 17:10 znicholls

I assume that I did think through these changes a while back, but can't recollect my thoughts right now...

But from a first-principles point of view, I do think that checking all items in a list is more intuitive than any for a "requirement".

Question to me is what your use case is? Are you trying to ensure that at least one of "Waste" or "Other" is present?

danielhuppmann avatar Oct 25 '23 14:10 danielhuppmann

But from a first-principles point of view, I do think that checking all items in a list is more intuitive than any for a "requirement".

Me too

Question to me is what your use case is? Are you trying to ensure that at least one of "Waste" or "Other" is present?

Trying to get this to behave https://github.com/iiasa/climate-assessment/pull/47 @jkikstra wrote the code that uses this and I assume was trying to do a check for one or both of them being there, but I don't actually know (see this comment for the function which calls it https://github.com/iiasa/climate-assessment/pull/47#issuecomment-1777718277)

znicholls avatar Oct 25 '23 15:10 znicholls

cc @phackstock

znicholls avatar Oct 25 '23 15:10 znicholls

Still not sure what the actual use case is from that comment, but I guess we could add a kwarg how={"all", "any"}, default 'all', inspired by pandas.dropna().

As for implementation, I guess if any data is present after applying the filters (=kwargs of require_data()) , then the "any"-requirement is satisfied for the filters.

danielhuppmann avatar Oct 25 '23 15:10 danielhuppmann

Here's the line where it's used: https://github.com/iiasa/climate-assessment/blob/485f3d24fc646ad8d77c65ac5e787a27dc79db04/src/climate_assessment/checks.py#L788

Up to @jkikstra and @phackstock whether it's easier to add the feature back into pyam or just hack a workaround into climate-assessment

znicholls avatar Oct 26 '23 06:10 znicholls

No strong feelings from my side either way. I would say in the interest of time it's better to build a workaround in climate-assessment.

phackstock avatar Oct 30 '23 16:10 phackstock