pyam
pyam copied to clipboard
How to aggregate by level?
Lets assume I have some data with four variable levels:
df = pyam.IamDataFrame(pd.DataFrame([
['IMG', 'a_scen', 'World', 'Emissions|CO2|Energy|Oil', 'Mt CO2/yr', 2, 3.2, 2.0, 1.8],
['IMG', 'a_scen', 'World', 'Emissions|CO2|Energy|Gas', 'Mt CO2/yr', 1.3, 1.6, 1.0, 0.7],
['IMG', 'a_scen', 'World', 'Emissions|CO2|Energy|BECCS', 'Mt CO2/yr', 0.0, 0.4, -0.4, 0.3],
['IMG', 'a_scen', 'World', 'Emissions|CO2|Foo|Oil', 'Mt CO2/yr', 2, 3.2, 2.0, 1.8],
['IMG', 'a_scen', 'World', 'Emissions|CO2|Foo|Gas', 'Mt CO2/yr', 1.3, 1.6, 1.0, 0.7],
['IMG', 'a_scen', 'World', 'Emissions|CO2|Foo|BECCS', 'Mt CO2/yr', 0.0, 0.4, -0.4, 0.3],
['IMG', 'a_scen', 'World', 'Emissions|CO2|Cars', 'Mt CO2/yr', 1.6, 3.8, 3.0, 2.5],
['IMG', 'a_scen', 'World', 'Emissions|CO2|Tar', 'Mt CO2/yr', 0.3, 0.35, 0.35, 0.33],
['IMG', 'a_scen', 'World', 'Emissions|CO2|Agg', 'Mt CO2/yr', 0.5, -0.1, -0.5, -0.7],
['IMG', 'a_scen', 'World', 'Emissions|CO2|LUC', 'Mt CO2/yr', -0.3, -0.6, -1.2, -1.0]
],
columns=['model', 'scenario', 'region', 'variable', 'unit', 2005, 2010, 2015, 2020],
))
df.timeseries()
How can I aggregate the last level (level=3)?
The result should then contain following variables:
'Emissions|CO2|Energy|'
'Emissions|CO2|Foo|'
'Emissions|CO2|Cars'
'Emissions|CO2|Tar'
'Emissions|CO2|Agg'
'Emissions|CO2|LUC'
And If I would aggregate to level 1 only
'Emissions|CO2'
would remain.
The answer to the first part of the question is
df.aggregate("Emissions|CO2|Energy")
There is also a tutorial on this, see https://pyam-iamc.readthedocs.io/en/stable/tutorials/aggregating_downscaling_consistency.html
You can also specify specific components, and you can use append=True
to directly aggregated data to df
. There is also a recursive
argument (though this has limitations and only works with summation, see the docs).
Thank you for the quick reply. My example data was not good enough and I adapted it to include more variables. I do not want to explicitly define the variables that should be aggregated but that variables should be determined by their level.
df.aggregate(level=2)
a) If there is no direct way, a possible work around might be to first determine a list of variables on a distinct level and then pass it as array:
level_1_variables = determine_variables_for_level(1)
df.aggregate(level_1_variables)
b) Specifying components seems only useful for the use case where I would like to aggregate some explicit list of variables and give a new name for the result.
c) Another strategy might be to first aggregate/group the pandas dataframe before converting it to pyam.
=> Is there already a method to aggregate by level or would I have to implement it on my own?
Got it, maybe something like the following:
var_list = df.filter(level=x).variable
df.aggregate(var_list)
Or if performance is critical (or you are working with a large dataset where you don't want to create a large copy)...
var_list = [v for v in df.variable if pyam.find_depth(v, level=x)]
df.aggregate(var_list)
See the docs of the variable-string utils here.
The above code would require that the variables are already explicitly mentioned on that level. I don't have an entry 'Emissions|CO2|Energy' in my original example data.
However, it includes some other entries that are already aggregated, e. g. 'Emissions|CO2|Cars'.
Following script seems to work:
level=2
full_var_list = list(set([pyam.reduce_hierarchy(v, level) for v in df.variable if pyam.find_depth(v, level=str(level) + '+')]))
already_aggregated_var_list = [v for v in df.variable if pyam.find_depth(v, level=level)]
var_list_to_aggregate = list(set(full_var_list) - set(already_aggregated_var_list ))
already_aggregated_df = df.filter(variable=already_aggregated_var_list )
df.filter(variable=already_aggregated_var_list, keep=False)\
.aggregate(variable=var_list_to_aggregate)\
.append(already_aggregated_df)\
.timeseries()
If there is a more elegant way to do this, please let me know.
Looks correct, though you may want to look at the recursive-aggregation-option again, which will be more performant in large datasets because it operates directly on the internal pandas.Series _data
object.
In the example above, using the following would work:
df.aggregate("Emissions|CO2", recursive=True, append=True)
In general, I would caution against too much automatization of your workflow. There may be variables in your dataset where simple summation is not appropriate, eg efficiency rates or prices. Even if you don't report these data now, you may add them later and forget that these need to be treated differently.
It may be easier and safer in the long run to determine the top-level-variables via inspection "by hand". Our brand-new nomenclature-iamc package is intended to manage lists of variables, see https://nomenclature-iamc.readthedocs.io/.
Thank you for your advice. The design of the variable structure is still in progress and I hope that our colleagues force it to be strictly hierarchical and design it in a way that allows easy aggregation and validation. However, maybe that is indeed unrealistic and having some custom column "aggregation_mode" or some external kind of variable manager / variable classification would be helpful.
In the nomenclature package, we defined a syntax for a "variable manager" with a list of nested dictionaries, where the key is the variable name and the value (of the outer dictionary) is again a dictionary, where the attribute skip-region-aggregation: true
indicates to skip region-aggregation as part of the processing.
See this unit-test-data here for an example.
Other attributes of the dictionary can be passed to the pyam-aggregate_region-method, see this unit-test-data here.
[We are still working on a full-fledged documentation...]