xpublish icon indicating copy to clipboard operation
xpublish copied to clipboard

Support nested collections of datasets (datatree)

Open norlandrhagen opened this issue 4 years ago • 9 comments

Hi there,

We want to use Datatree (a new package for working with hierarchies of xarray Datasets) together with Xpublish. A single datatree.DataTree can be written to a zarr dataset where subgroups typically contain an xarray.Dataset and optional subgroups.

Our specific application is looking to serve data from a multi-dimensional data pyramid (see ndpyramid for more details) that looks something like:

/
 ├── .zmetadata
 └── .zgroup
 └── .zattrs
 ├── 0
 │   └── .zgroup
 │   ├── tavg
 │       └── .zarray
 │       └── 0.0
 ├── 1
 │   └── .zgroup
 │   ├── tavg
 │       └── .zarray
 │       └── 0.0
 │       └── 0.1
 │       └── 1.0
 │       └── 1.1
 ├── 2
…

We could serve each subgroup independently but that is less desirable since the top level group metadata (stored in .zarrs and in the consolidated .zmetadata) is needed to describe the relationship among groups.

Proposed feature addition

My assumption is that to serve a dataset like the one I described above, we need to build a custom router for DataTrees. This new router, we’ll call it the ZarrDataTreeRouter, would be able to reuse many of the existing zarr endpoints, but would support a more nested data model.

In https://github.com/carbonplan/maps/issues/15, @benbovy suggested that this sort of support would make sense here so, perhaps we can simply ask for some pointers on how to architect the ZarrDataTreeRouter?

One specific question we have is how an implementation of this should interface with #88 and #89. Both which seem to be reshaping how complex, custom routers are developed.

cc @jhamman

norlandrhagen avatar Sep 30 '21 17:09 norlandrhagen

In carbonplan/maps#15, @benbovy suggested that this sort of support would make sense here so, perhaps we can simply ask for some pointers on how to architect the ZarrDataTreeRouter?

For responsive front-end applications, it's probably better to pre-compute all pyramid levels before serving them. So it would be best if xpublish could serve pre-computed DataTree objects.

This is something that xpublish could support, but then we would probably need something to distinguish between API routers which have special support for DataTree objects and those which don't (I assume that a DataTree could easily be reduced to a Dataset -- either flattened or just use the top-level group -- so all routers would have basic support for it?). #89 might certainly help here.

The alternative approach using a custom xarray "index" has the advantage that we could still use xpublish as usual with Dataset objects and hide all the pyramid logic and data-structures within the "index" that we can build before serving the datasets. However, Xarray custom indexes are not ready for use yet.

benbovy avatar Oct 01 '21 16:10 benbovy

A 3rd approach that may work now without much effort is to encapsulate the DataTree object in a custom dataset accessor, e.g.,

ds = xr.open_dataset(...)

# create the data pyramids using ndpyramid
# and store the resulting datatree in an internal attribute
# of the `pyramids` dataset accessor 
ds.pyramids.build(...)

# property that returns the datatree
ds.pyramids.datatree

Then, you could write a custom xpublish API router with path functions in which you can access the datatree via the pyramids accessor. I think you could also easily reuse the helper functions already available in Xpublish to provide a Zarr-like REST API.

benbovy avatar Oct 07 '21 10:10 benbovy

cc @TomNicholas

cisaacstern avatar Apr 26 '22 19:04 cisaacstern

We have a group of folks taking a look at this at the IOOS Code Sprint this week. We'd love to be able to bring a dynamic CarbonPlan/maps type experience to our various regions forecast data.

From our discussions so far, I think the things we need to focus on to make this happen are:

  • Can we generate the various levels of metadata on the fly without loading data?
  • Can we generate a single zarr array chunk on request, without coarsening the entire dataset?

Full disclosure I've only gushed over carbonplan/map, datatree, and nbpyramid rather than having used them in anger, though others in our group have.

abkfenris avatar Apr 26 '22 20:04 abkfenris

Can we generate the various levels of metadata on the fly without loading data?

If the reprojectiong/tiling step is lazy, this should be possible. We've experimented with a few ways to do this and, so far, the Xesmf method is the most promising (see ndpyramid.regrid.pyramid_regrid).

Can we generate a single zarr array chunk on request, without coarsening the entire dataset?

This also seems to be easier with Xesmf but, if you end up using rasterio to reproject/tile, I think you'll need to bring in some custom logic in to generate each chunk.

jhamman avatar Apr 26 '22 20:04 jhamman

I am working with @abkfenris on this, and to expand, our specific reasoning for wanting to do is that model data is non static in the time dimension AND our datasets are often not global. So precomputing is inefficient for us, but understand it can be a niche use case.

So far I am working with rasterio to tile, just now starting to think about how this applies to zarr chunking

mpiannucci avatar Apr 26 '22 20:04 mpiannucci

Can we generate a single zarr array chunk on request, without coarsening the entire dataset?

If I understand well the problem, I think it should be possible to create a custom API endpoint where, e.g., an input bounding box (and/or an input time value or slice) is used to first index the dataset (using xarray .sel()), then perform on-the-fly coarsen / regrid / reproject operations and finally send the generated data in the zarr format or another format as the response.

Those operations may take time, though, probably too much for interactive visualization applications, but if the input bounding boxes are on a static grid (tiles) it may be worth caching the intermediate results (using xpublish's cachey cache) to further speed-up the query / response times.

Not sure to which extent it is possible to "just" reuse here the logic currently implemented in xpublish that is used to serve the dataset chunks through its zarr API endpoints.

benbovy avatar Apr 26 '22 21:04 benbovy

Thanks for the input that folks had on this. While we didn't solve it during our code sprint, we did make some headway.

Right now folks are traveling back from the event, but we are going to try to compile what we found and written down (rather than just the mid event muttering back and forth over Zoom). Hopefully we can have that ready to share in the next few days.

abkfenris avatar Apr 28 '22 20:04 abkfenris

With the new plugin system, this should now be possible without changes to Xpublish itself.

It would involve writing a app router plugin to serve datatrees under a new path. The app router can then mount existing dataset routers to it's prefix, and providing them a modified Dependency with a .dataset that instead parses a path to figure out which datatree, and the path into the tree (using these FastAPI methods).

To make it nicely adaptable, plugins can also define new hooks that other plugins can implement to further extend Xpublish, so a datatree router could include get_datatree and get_datatree_keys hooks similar to what Xpublish provides now that other plugins could use for loading datatrees and keep a separation of loading and routing. Then the router can have a set of methods like these to query other plugins for datatrees.

abkfenris avatar Mar 30 '23 20:03 abkfenris