intake-esm icon indicating copy to clipboard operation
intake-esm copied to clipboard

User-defined merge operation

Open rabernat opened this issue 6 years ago • 3 comments

One issue that has emerged from the hackathon is the desired to apply user-defined preprocessing and / or customized merging function during the dataset loading process.

@jbusecke has a great example of this. He wants to use xgcm to calculate derivatives, divergence, etc. This requires that the data variables on a staggered grid model (e.g. C-grid, B-grid) be labeled with different dimensions depending on their grid position (e.g cell center, lon_c, cell face, lon_g, etc.) We have found that many (all) of the CMIP6 data just use the same dimension names (e.g. lon, lat) for all variables (e.g. theatao, uo, vo), regardless of their grid position. However, this information is implicitly encoded in the actual coordinate values, which are different. (Related to #151.)

Currently, if you use intake-esm to merge, say, thetao and uo, you will end up doubling the length of each variable and inserting missing values, due to the staggered nature of the coordinates. Instead, what we want to do is relabel these as distinct dimensions.

Julius has some code to do this here: https://github.com/jbusecke/cmip6_preprocessing/blob/master/cmip6_preprocessing/recreate_grids.py

This is similar to function I wrote a while back for POP: https://github.com/jbusecke/cmip6_preprocessing/blob/master/cmip6_preprocessing/recreate_grids.py

How would we plug this sort of custom merge operation into intake-esm?

Would it be possible to provide a callback for a custom merge function? If so, what API would such a function have to implement? Can we define an interface for this?

cc @naomi-henderson

rabernat avatar Oct 16 '19 21:10 rabernat

I think it may be relatively easy to implement a preprocess argument to to_dataset_dict that would accept a function that operates on the individual datasets as they are opened (could also accept kwargs). It could do something like this

def fix_coords(ds):
  if "vo" in ds.data_vars or "uo" in ds.data_vars:
    # rename coords/dims lon-->lon_u etc.

and the user would do something like this

dsets_dict = col.to_dataset_dict(preprocess=fix_coords)

This doesn't get all the way to an xgcm compatible dataset, but would prevent xarray from interleaving the coordinates and the full xgcm-ification that @jbusecke is doing could be applied to the returned, merged dataset (unless I've missed something). This should be easy to implement.

I think the custom merge is more complicated. We will need to think through and generalize how the merge-function gets information from the collection, i.e. which dataset is which, etc.

matt-long avatar Oct 17 '19 14:10 matt-long

@jbusecke, @rabernat, Does #155 provide enough functionality to address this issue?

matt-long avatar Oct 17 '19 16:10 matt-long

Testing now...

jbusecke avatar Oct 17 '19 17:10 jbusecke