intake-esm icon indicating copy to clipboard operation
intake-esm copied to clipboard

ESMSource class for collections in Intake catalogs

Open charlesbluca opened this issue 4 years ago • 3 comments
trafficstars

Is your feature request related to a problem? Please describe.

Currently, there doesn't seem to be any source class for Intake-esm collections, meaning that any Intake catalogs containing them must use intake_esm.esm_datastore as the driver (seen in Pangeo's climate catalog)

plugins:
  source:
    - module: intake_esm

sources:
  cmip6_gcs:
    args:
      esmcol_obj: "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
    description: 'CMIP6 in Google Cloud Storage'
    driver: intake_esm.esm_datastore
    metadata: {}

This means that accessing these entries directly calls the intake_esm.esm_datastore constructor and consequently loads the Intake-esm collection's underlying DataFrame into memory:

In [1]: import intake

In [2]: cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/climate.yaml")

In [3]: cat["cmip6_gcs"]
Out[3]: <pangeo-cmip6 catalog with 5822 dataset(s) from 351564 asset(s)>

This can be a computationally expensive task for larger collections, and in some cases completely unnecessary if we only wish to view the metadata of the collection's entry.

Describe the solution you'd like

The implementation of an ESMSource class, similar to intake-xarray's ZarrSource, which would store the initial arguments to create an esm_datastore, but wouldn't initialize it until a dedicated method was called:

In [4]: cat["cmip6_gcs"]
Out[4]: <name: cmip6_gcs>

In [5]: cat["cmip6_gcs"].load()
Out[5]: <pangeo-cmip6 catalog with 5822 dataset(s) from 351564 asset(s)>

This ESMSource could then be supplied as a driver in Intake catalogs, making it substantially faster to crawl catalogs containing ESM collections.

Describe alternatives you've considered

The current implementation of ESM collections within Intake catalogs works fine for accessing singular collections; when crawling catalogs with ESM collections, I typically use cat._entries["some_esm_collection"] to avoid directly loading the collections. This succeeds in getting the metadata of an ESM collection without opening it, but can be a cumbersome use case when crawling catalogs with mixed entry types.

charlesbluca avatar Dec 09 '20 21:12 charlesbluca

@charlesbluca, I think this is a great idea, would you be interested in submitting a PR? :)

andersy005 avatar Dec 10 '20 20:12 andersy005

Sure! I'll use this issue for any outstanding questions I have in working on this.

charlesbluca avatar Dec 11 '20 15:12 charlesbluca

Looking through intake-esm/source.py, it seems I've spoken too soon! There are the ESMDataSource and ESMGroupDataSource classes, which can be used as drivers for Intake, although their behavior is different from something like intake-xarray.ZarrSource.

In particular, the source classes look for a pandas.Series or pandas.DataFrame as input, respectively, which I'm not exactly sure how to do in Intake - would this be accomplished by providing something like the output of pandas.*.to_json() but YAML formatted?

Regardless, I'm happy to conceptualize a data source class that takes the URL of an ESM collection as its primary argument (maybe called ESMCollectionDataSource?).

charlesbluca avatar Dec 11 '20 16:12 charlesbluca