intake-esm
intake-esm copied to clipboard
ESMSource class for collections in Intake catalogs
Is your feature request related to a problem? Please describe.
Currently, there doesn't seem to be any source class for Intake-esm collections, meaning that any Intake catalogs containing them must use intake_esm.esm_datastore as the driver (seen in Pangeo's climate catalog)
plugins:
source:
- module: intake_esm
sources:
cmip6_gcs:
args:
esmcol_obj: "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
description: 'CMIP6 in Google Cloud Storage'
driver: intake_esm.esm_datastore
metadata: {}
This means that accessing these entries directly calls the intake_esm.esm_datastore constructor and consequently loads the Intake-esm collection's underlying DataFrame into memory:
In [1]: import intake
In [2]: cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/climate.yaml")
In [3]: cat["cmip6_gcs"]
Out[3]: <pangeo-cmip6 catalog with 5822 dataset(s) from 351564 asset(s)>
This can be a computationally expensive task for larger collections, and in some cases completely unnecessary if we only wish to view the metadata of the collection's entry.
Describe the solution you'd like
The implementation of an ESMSource class, similar to intake-xarray's ZarrSource, which would store the initial arguments to create an esm_datastore, but wouldn't initialize it until a dedicated method was called:
In [4]: cat["cmip6_gcs"]
Out[4]: <name: cmip6_gcs>
In [5]: cat["cmip6_gcs"].load()
Out[5]: <pangeo-cmip6 catalog with 5822 dataset(s) from 351564 asset(s)>
This ESMSource could then be supplied as a driver in Intake catalogs, making it substantially faster to crawl catalogs containing ESM collections.
Describe alternatives you've considered
The current implementation of ESM collections within Intake catalogs works fine for accessing singular collections; when crawling catalogs with ESM collections, I typically use cat._entries["some_esm_collection"] to avoid directly loading the collections. This succeeds in getting the metadata of an ESM collection without opening it, but can be a cumbersome use case when crawling catalogs with mixed entry types.
@charlesbluca, I think this is a great idea, would you be interested in submitting a PR? :)
Sure! I'll use this issue for any outstanding questions I have in working on this.
Looking through intake-esm/source.py, it seems I've spoken too soon! There are the ESMDataSource and ESMGroupDataSource classes, which can be used as drivers for Intake, although their behavior is different from something like intake-xarray.ZarrSource.
In particular, the source classes look for a pandas.Series or pandas.DataFrame as input, respectively, which I'm not exactly sure how to do in Intake - would this be accomplished by providing something like the output of pandas.*.to_json() but YAML formatted?
Regardless, I'm happy to conceptualize a data source class that takes the URL of an ESM collection as its primary argument (maybe called ESMCollectionDataSource?).