intake-esm icon indicating copy to clipboard operation
intake-esm copied to clipboard

Add a DerivedCatalog object to deal with derived variables

Open mgrover1 opened this issue 4 years ago • 2 comments
trafficstars

Similar to the development in esds-funnel, we think it would be useful to be able to add "derived variables" to a catalog, accessible via an api similar to this:

DerivedCatalog.add_variable(intake_esm_catalog, variable='TEMP_100m', dependent_variable=['TEMP'])

mgrover1 avatar Aug 17 '21 22:08 mgrover1

The result (with adding an SST variable), would be: Screen Shot 2021-08-17 at 4 32 15 PM

mgrover1 avatar Aug 17 '21 22:08 mgrover1

I took a stab at this. My current approach is similar to Matt's in that I'm keeping track of derived variable's info in a registry attached to the intake_esm catalog object via .derivedcat attribute:

Initially this derivedcat registry is empty

In [1]: import intake, intake_esm

In [2]: cat = intake.open_esm_datastore("./tests/sample-collections/catalog-dict-records.json")

In [4]: cat.unique()
Out[4]: 
component                                                       [atm]
frequency                                                     [daily]
experiment                                                      [20C]
variable                             [FLNS, FLNSC, FLUT, FSNS, FSNSC]
path                [s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS...
derived_variable                                                   []
dtype: object

The user can register their derivation function via a decorator.

In [5]: @intake_esm.register_derived_variable(varname="FOO", required=[{'variable': "TEMP", "component": "ocn"}])
   ...: def func(ds):
   ...:     return ds.TEMP + 1
   ...: 

The user should be able to validate the derived catalog whenever they want via

In [9]: cat.validate_derivedcat()
Looks good!
This validation method looks like
        for key, entry in self.derivedcat.items():
            for req in entry.required:
                for col in req:
                    if col not in self.esmcat.df.columns:
                        raise ValueError(
                            f"{key} requires {col} to be in the ESM catalog columns: {self.esmcat.df.columns.tolist()}"
                        )
                if self.esmcat.aggregation_control.variable_column_name not in req.keys():
                    raise ValueError(
                        f"Variable derivation requires *{self.esmcat.aggregation_control.variable_column_name}* to be in the dictionary of requirements: {req}"
                    )
        else:
            print('Looks good!')

Operations like nunique() and unique() are able to merge the information from both the main (base) catalog and the derived variable registry

In [6]: cat.unique()
Out[6]: 
component                                                       [atm]
frequency                                                     [daily]
experiment                                                      [20C]
variable                             [FLNS, FLNSC, FLUT, FSNS, FSNSC]
path                [s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS...
derived_variable                                                [FOO]
dtype: object

In [8]: cat.derivedcat
Out[8]: {'FOO': DerivedVariable(func=<function func at 0x1072dc310>, required=[{'variable': 'TEMP', 'component': 'ocn'}])}
  • Is this API good enough?
  • How should the .search() work?
    1. Should we return subsets of the main(base) catalog and derived catalog or
    2. should we keep the derived catalog intact i.e. return the subset of the base catalog + everything in the derived catalog?

Cc @matt-long, @kmpaul, @mgrover1

andersy005 avatar Oct 13 '21 19:10 andersy005