intake-esm
intake-esm copied to clipboard
Add a DerivedCatalog object to deal with derived variables
Similar to the development in esds-funnel, we think it would be useful to be able to add "derived variables" to a catalog, accessible via an api similar to this:
DerivedCatalog.add_variable(intake_esm_catalog, variable='TEMP_100m', dependent_variable=['TEMP'])
The result (with adding an SST variable), would be:

I took a stab at this. My current approach is similar to Matt's in that I'm keeping track of derived variable's info in a registry attached to the intake_esm catalog object via .derivedcat attribute:
Initially this derivedcat registry is empty
In [1]: import intake, intake_esm
In [2]: cat = intake.open_esm_datastore("./tests/sample-collections/catalog-dict-records.json")
In [4]: cat.unique()
Out[4]:
component [atm]
frequency [daily]
experiment [20C]
variable [FLNS, FLNSC, FLUT, FSNS, FSNSC]
path [s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS...
derived_variable []
dtype: object
The user can register their derivation function via a decorator.
In [5]: @intake_esm.register_derived_variable(varname="FOO", required=[{'variable': "TEMP", "component": "ocn"}])
...: def func(ds):
...: return ds.TEMP + 1
...:
The user should be able to validate the derived catalog whenever they want via
In [9]: cat.validate_derivedcat()
Looks good!
This validation method looks like
for key, entry in self.derivedcat.items():
for req in entry.required:
for col in req:
if col not in self.esmcat.df.columns:
raise ValueError(
f"{key} requires {col} to be in the ESM catalog columns: {self.esmcat.df.columns.tolist()}"
)
if self.esmcat.aggregation_control.variable_column_name not in req.keys():
raise ValueError(
f"Variable derivation requires *{self.esmcat.aggregation_control.variable_column_name}* to be in the dictionary of requirements: {req}"
)
else:
print('Looks good!')
Operations like nunique() and unique() are able to merge the information from both the main (base) catalog and the derived variable registry
In [6]: cat.unique()
Out[6]:
component [atm]
frequency [daily]
experiment [20C]
variable [FLNS, FLNSC, FLUT, FSNS, FSNSC]
path [s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS...
derived_variable [FOO]
dtype: object
In [8]: cat.derivedcat
Out[8]: {'FOO': DerivedVariable(func=<function func at 0x1072dc310>, required=[{'variable': 'TEMP', 'component': 'ocn'}])}
- Is this API good enough?
- How should the
.search()work?- Should we return subsets of the main(base) catalog and derived catalog or
- should we keep the derived catalog intact i.e. return the subset of the base catalog + everything in the derived catalog?
Cc @matt-long, @kmpaul, @mgrover1