pyam icon indicating copy to clipboard operation
pyam copied to clipboard

df.aggregate_region() when inconsistent index (e.g. years)

Open byersiiasa opened this issue 4 years ago • 2 comments

There are instances when doing weighted aggregation of regions, when years between the variable and the weight variable may not be aligned or one variable has more years than the other. One gets error message:

  File "c:\github\pyam\pyam\aggregation.py", line 216, in _agg_weight
    raise ValueError("Inconsistent index between variable and weight!")

ValueError: Inconsistent index between variable and weight!

which comes from here:

    if not data.droplevel(["variable", "unit"]).index.equals(weight.index):
        raise ValueError("Inconsistent index between variable and weight!")

It would be nice if there's a way to improve the logic here - e.g. for example, the minimum requirement should be not that the indexes are equal, but that the index of the variable A being weighted, is 100% present within the weight variable B. Im rusty on set notation, but basically that index A is a subset of index B?

Any quick suggestion how this should be done and/or arguments against?

byersiiasa avatar Jul 06 '21 15:07 byersiiasa

I'm wondering what a reason could be that a variable is reported for a smaller set of years than the weights which are used to aggregate it...?

I don't think that this behavior is a bug, just a very conservative approach to ensure that the data makes sense before performing any operations.

Generally speaking, you would have to make an assumption on how to treat missing values, and create some logic when/how to infer this - only on missing years, or also on missing regions? How to distinguish between a sensible "hole" vs. a reporting error that is being propagated through the processing workflow?

As a preferred solution, I would not change anything in the pyam function itself, but fix the data before calling the function, e.g., if the "variable" exists for a smaller set of years than the "weight":

var_years = df.filter(variable="variable").year
consistent_df = df.filter(variable=["variable", "weight"], year=var_years)
agg_data = consistent_df.aggregate("variable", weight="weight")
df.append(agg_data, inplace=True)

danielhuppmann avatar Jul 06 '21 16:07 danielhuppmann

In the real world! haha in AR6, REMIND team (of course), are reporting population for full century, and/or including some reference values from 2005, but some of their other variables only start at 2015 and only go to 2050....

Thanks for your solution though - given that weighted aggregation needs to be looped for every variable-weight pair, then this should be fine

byersiiasa avatar Jul 07 '21 09:07 byersiiasa