ADF icon indicating copy to clipboard operation
ADF copied to clipboard

Should diagnostics package take advantage of "Intake-esm"?

Open nusbaume opened this issue 3 years ago • 5 comments

It looks like a new python package has been developed at NCAR called "Intake-esm", which is designed to query and load various forms of output from Earth System Models. The documentation for this package can be found here:

https://intake-esm.readthedocs.io/en/latest/

It might be worth seeing if this package can be used in the CAM diagnostics package to improve the reading of CAM (or other model) output files, at least in situations where one can't use NCO operators. Any thoughts?

nusbaume avatar Apr 05 '21 20:04 nusbaume

Agree. I have looked at intake (and to a lesser extent at intake-esm) quite a few times. My brain doesn't quite get it, but I do think we should evaluate whether it is a better way to handle the input files.

brianpm avatar Apr 05 '21 21:04 brianpm

I think it would be useful to use the ecgtools package here to help with this... within MDTF, they are essentially creating an intake-esm catalog from the directory hierarchy, then using that to access data throughout the different scripts. We can setup a time to chat about this if all would like... I think this will be key to incorporate within the different diagnostics packages

mgrover1 avatar Jun 16 '21 18:06 mgrover1

Thanks, @mgrover1. I continue to be intrigued by the idea of using intake. I have to admit that I'm still having trouble understanding how it helps for single runs or pairs of runs. I was actually just playing with intake-esm this morning (not using ecgtools); I'm able to generate a Collection and a Catalog and access it... but for a single run this seems like it definitely adds overhead on the way to getting to actually doing something with the data. I think I'm missing some important nuances about the problem intake/intake-esm/ecgtools is solving. Definitely we should chat about it.

Along these lines, one of the issues I'm struggling with is I think very much in line with what intake might be able to handle. In ADF we'd like to give the option to start from history files or time series files, generate climatology files along the way, and then allow the analysis and plotting scripts to access any of that data. That means we need a data structure that allows us to efficiently determine what data is available and get to it. I think the CamDiag class has some of this already, but I think it has some shortcomings. I started (yesterday) exploring the idea of using dataclasses as a way to introduce something like a "CaseDescriptor" class that could be passed around within the diagnostics to allow ubiquitous access to all the information/files about each case (or set of obs). Anyway, I'm rambling at this point, but we should talk!

brianpm avatar Jun 16 '21 19:06 brianpm

The nice thing about using the intake-esm catalog is that it would be able to work either historical or timeseries files... it esentially parameterizes data access, where one could use the following steps

  1. Build catalog from some case output directory using ecgtools
  2. Read in the catalog from some script
col = intake.open_esmdatastore('some_catalog').search('variable')
then use 
col.to_dataset_dict()

This would help from needing to specify where to look for each set of files - it is flexible with the search. Also, it would allow for files stored externally too, as well as working with zarr or netcdf. The key here is parameterizing data access. This method of parameterizing data access is what mdtf is doing within its "data manager", although to less-flexible extent which relies on rigid directory structure

mgrover1 avatar Jun 16 '21 20:06 mgrover1

Adding notes here to myself so that I don't forget them:

  1. Instructions on how to create a custom parser (needed for generating an AMWG-obs catalog):

    https://ecgtools.readthedocs.io/en/latest/how-to/use-a-custom-parser.html

  2. General ecg-tools documentation (which will help implement general Intake-ESM use in the ADF):

    https://ecgtools.readthedocs.io/en/latest/index.html

  3. For calculating derived variables (or other operations like unit conversions), funnel is being replaced by "xcollections":

    https://xcollection.readthedocs.io/en/latest/

Hopefully these links will help with the final Intake-ESM implmentation!

nusbaume avatar Feb 07 '22 18:02 nusbaume