datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Intake, catalogs, and datatree

Open TomNicholas opened this issue 1 year ago • 3 comments

Thanks @TomNicholas and sorry for creating issue noise. I guess I got a bit carried away with these comments in the readme:

  • Has functions for mapping user-supplied functions over every node in the tree,
  • Automatically dispatches some of xarray.Dataset's API over every node in the tree (such as .isel),

I was thinking that maybe the datatree abstraction could be a more formalised and ultimately 'xarray native' approach to the the problems that have been tackled by e.g. intake-esm and intake-thredds. Leaves in the tree could compositions over netcdf files, which may be aggregated JSON indexes. I guess I was thinking that some sort of formalism over a nested datastructure could help in dask computational graph composition. I have run into issues where the scheduler gets overloaded, or just takes forever to start for calculations across large datasets composed with i.e. mf_opendataset

I wonder if @andersy005, @mdurant or @rsignell have any experience or thoughts about if it makes any sense for interface between this library and intake?

Originally posted by @pbranson in https://github.com/xarray-contrib/datatree/issues/97#issuecomment-1200292141

TomNicholas avatar Jul 30 '22 21:07 TomNicholas

@pbranson thanks for your ideas about integration of datatree with the intake ecosystem, this is definitely something I'm really interested in, and a potential use case I had in mind when originally creating this package.

I was thinking that maybe the datatree abstraction could be a more formalised and ultimately 'xarray native' approach to the the problems that have been tackled by e.g. intake-esm and intake-thredds.

I think this makes sense. Datatree is almost like an in-memory catalog of datasets.

Leaves in the tree could compositions over netcdf files, which may be aggregated JSON indexes.

Yep. There are probably lots of cool possibilities. My priority would be to build datatree in such a way that other packages can easily understand the model and experiment with interfacing in ways they think are sensible.

I have run into issues where the scheduler gets overloaded, or just takes forever to start for calculations across large datasets composed with i.e. mf_opendataset

I think this poor performance could be a bunch of different problems, and I'm not sure if datatree actually solves any of the dask-side issues. Datatree just makes it easier to express the complex operation which behaves poorly when run via dask.

cc @rabernat who has also pointed out the correspondence between datatree and intake catalogs to me before.

TomNicholas avatar Jul 30 '22 21:07 TomNicholas

@TomNicholas Thanks for breaking this out of #97!

I should have guessed that this would have been part of your discussions! I just scanned back over the issues prompting the creation of datatree

I think this poor performance could be a bunch of different problems, and I'm not sure if datatree actually solves any of the dask-side issues.

The dask-side challenges could entirely be due to detail with my naïve usage! :-)

pbranson avatar Jul 30 '22 22:07 pbranson

The ability to open a set of intake catalogs as a DataTree was actually added to intake-esm in https://github.com/intake/intake-esm/pull/512 by @mgrover!

Separately it's also been suggested that we might want to write a plugin for intake proper.

TomNicholas avatar Aug 31 '22 21:08 TomNicholas