datatree icon indicating copy to clipboard operation
datatree copied to clipboard

open_mfdatatree

Open TomNicholas opened this issue 3 years ago • 6 comments

Currently we have an open_datatree function which opens a single netcdf file (or zarr store). We could imagine an open_mfdatatree function which is analogous to open_mfdataset, which can open multiple files at once.

As DataTree has a structure essentially the same as that of a filesystem, I'm imagining a use case where the user has a bunch of data files stored in nested directories, e.g.

project
    /experimental
        data.nc
    /simulation
        /highres
            output.nc
        /lowres
            output.nc

We could look through all of these folders recursively, open any files found of the correct format, and store them in a single tree.

We could even allow for multiple data files in each folder if we called open_mfdataset on all the files found in each folder.

EDIT: We could also save a tree out to multiple folders like this using a save_mfdatatree method.

This might be particularly useful for users who want the benefit of a tree-like structure but are using a file format that doesn't support groups.

TomNicholas avatar Dec 16 '21 22:12 TomNicholas

In the case of save_mfdatatree, where would it save the global and group level attributes? I see two paths:

  • Each file preserves the upper levels. For instance, in your example, data.nc would still use groups inside it such as /experimental/data, but missing the other branches, while preserving global attributes as well as attributes for experimental.
  • Since attributes are relevant for all levels underneath it, the global attributes from the project would be carried to experimental, combined giving precedence for experimental when duplicated, and carried to data, giving precedence to data attributes if duplicated. For instance, data.nc would inherit the attribute Conventions from the top-level project. By doing that, the data.nc would be complete, and self-containing, without losing relevant information.

Extending a little on the second option, it could be a nice functionality to be able to extract any level in the tree without losing information. It could be a layer before actually exporting to a netCDF. If I have a DataTree object project like in your example and I'm only interested in the high-resolution output, are there already the functionality for something like project.flatten("/simulation'/highres") that preserves the upper levels, attributes, and variables?

castelao avatar Sep 01 '22 01:09 castelao

In the case of save_mfdatatree, where would it save the global and group level attributes?

Each file preserves the upper levels.

I think this is what I was imagining. That's the most direct and simple mapping between an in-memory datatree and a set of folders and .nc files.

the global attributes from the project would be carried to experimental ...

I'm hesistant to do anything that introduces "inheritance" from nodes above like this. The problem is that different group-supporting formats have different hierarchical behaviours, and so something that follows netCDF might be weird with Zarr. Ultimately the in-memory DataTree should only work one way, so a choice has to be made there (and so far I've gone for the simplest choice: independence between nodes.) That said you could imagine have a kwarg to save_mfdatatree that changes behaviour like this when saving.

If I have a DataTree object project like in your example and I'm only interested in the high-resolution output, are there already the functionality for something like project.flatten("/simulation'/highres") that preserves the upper levels, attributes, and variables?

There is no specific method for flattening parts of the tree, but we can make one! (xref #79) I'm not quite sure what you want it to do though - what type would you want project.flatten("/simulation'/highres") to return?

TomNicholas avatar Sep 01 '22 18:09 TomNicholas

Sounds wise preserving the structure. I have two suggestions on that:

  • Keep track of where it came from. Maybe use the global attribute source in data.nc and output.nc. Possibly pointing to the id attribute of the original project. id should be unique if following ACDD-1.3
  • Some relevant information might be left on higher levels. In your example, let's assume that depth is common between highres and lowres, so it was stored on the simulation group level. In that case highres/output.nc and lowres/output.nc are incomplete, not self containing. One option to avoid redundancy on higher-level variables would be using External Variables attribute.

On the flattening, I was thinking on something like project.subset(["/project/hires/temperature", "/project/hires/doxy"]).squeeze(). This would extract a subset of variables and all other variables and attributes from upper levels. Then, some cases might make sense to flatten "unnecessary layers". For instance, if the outcome is time, lat, lon, and sea surface height, I might just use a flat dataset. I envision cases where it makes sense to distribute a large consistent and complete datasets, such as a full simulation and its products, or all variables measured by different sensors aboard the same satellite, or all products from a single glider mission. But from the user perspective, it is common for someone to be interested in a single branch of that hierarchical tree, and in that case, information spread on multiple levels adds unnecessary complexity.

I have no strong opinion on any of those, but just ideas.

castelao avatar Sep 05 '22 19:09 castelao

There is no specific method for flattening parts of the tree

Just found this: https://gitlab.eumetsat.int/open-source/netcdf-flattener/

dcherian avatar Feb 07 '23 17:02 dcherian

@dcherian , thanks for pointing that out! @erget is a major contributor to the CF-Conventions and a great person to work with. Maybe there is a common interest here.

castelao avatar Feb 08 '23 04:02 castelao

Our team is interested in open_mfdatatree and save_mfdatatree as well, but for the purpose of avoiding large files. Our total dataset size is hundreds of GB and so it would be nice to have a set of smaller netCDF files for each week of data.

Evidlo avatar Oct 11 '23 05:10 Evidlo