datatree
datatree copied to clipboard
open_mfdatatree
Currently we have an open_datatree
function which opens a single netcdf file (or zarr store). We could imagine an open_mfdatatree
function which is analogous to open_mfdataset
, which can open multiple files at once.
As DataTree
has a structure essentially the same as that of a filesystem, I'm imagining a use case where the user has a bunch of data files stored in nested directories, e.g.
project
/experimental
data.nc
/simulation
/highres
output.nc
/lowres
output.nc
We could look through all of these folders recursively, open any files found of the correct format, and store them in a single tree.
We could even allow for multiple data files in each folder if we called open_mfdataset
on all the files found in each folder.
EDIT: We could also save a tree out to multiple folders like this using a save_mfdatatree
method.
This might be particularly useful for users who want the benefit of a tree-like structure but are using a file format that doesn't support groups.
In the case of save_mfdatatree
, where would it save the global and group level attributes? I see two paths:
- Each file preserves the upper levels. For instance, in your example,
data.nc
would still use groups inside it such as/experimental/data
, but missing the other branches, while preserving global attributes as well as attributes for experimental. - Since attributes are relevant for all levels underneath it, the global attributes from the project would be carried to experimental, combined giving precedence for experimental when duplicated, and carried to data, giving precedence to data attributes if duplicated. For instance,
data.nc
would inherit the attributeConventions
from the top-level project. By doing that, thedata.nc
would be complete, and self-containing, without losing relevant information.
Extending a little on the second option, it could be a nice functionality to be able to extract any level in the tree without losing information. It could be a layer before actually exporting to a netCDF. If I have a DataTree object project
like in your example and I'm only interested in the high-resolution output, are there already the functionality for something like project.flatten("/simulation'/highres")
that preserves the upper levels, attributes, and variables?
In the case of save_mfdatatree, where would it save the global and group level attributes?
Each file preserves the upper levels.
I think this is what I was imagining. That's the most direct and simple mapping between an in-memory datatree and a set of folders and .nc
files.
the global attributes from the project would be carried to experimental ...
I'm hesistant to do anything that introduces "inheritance" from nodes above like this. The problem is that different group-supporting formats have different hierarchical behaviours, and so something that follows netCDF might be weird with Zarr. Ultimately the in-memory DataTree
should only work one way, so a choice has to be made there (and so far I've gone for the simplest choice: independence between nodes.) That said you could imagine have a kwarg to save_mfdatatree
that changes behaviour like this when saving.
If I have a DataTree object project like in your example and I'm only interested in the high-resolution output, are there already the functionality for something like project.flatten("/simulation'/highres") that preserves the upper levels, attributes, and variables?
There is no specific method for flattening parts of the tree, but we can make one! (xref #79) I'm not quite sure what you want it to do though - what type would you want project.flatten("/simulation'/highres")
to return?
Sounds wise preserving the structure. I have two suggestions on that:
- Keep track of where it came from. Maybe use the global attribute
source
indata.nc
andoutput.nc
. Possibly pointing to theid
attribute of the originalproject
.id
should be unique if followingACDD-1.3
- Some relevant information might be left on higher levels. In your example, let's assume that
depth
is common betweenhighres
andlowres
, so it was stored on thesimulation
group level. In that casehighres/output.nc
andlowres/output.nc
are incomplete, not self containing. One option to avoid redundancy on higher-level variables would be using External Variables attribute.
On the flattening, I was thinking on something like project.subset(["/project/hires/temperature", "/project/hires/doxy"]).squeeze()
. This would extract a subset of variables and all other variables and attributes from upper levels. Then, some cases might make sense to flatten "unnecessary layers". For instance, if the outcome is time, lat, lon, and sea surface height, I might just use a flat dataset. I envision cases where it makes sense to distribute a large consistent and complete datasets, such as a full simulation and its products, or all variables measured by different sensors aboard the same satellite, or all products from a single glider mission. But from the user perspective, it is common for someone to be interested in a single branch of that hierarchical tree, and in that case, information spread on multiple levels adds unnecessary complexity.
I have no strong opinion on any of those, but just ideas.
There is no specific method for flattening parts of the tree
Just found this: https://gitlab.eumetsat.int/open-source/netcdf-flattener/
@dcherian , thanks for pointing that out! @erget is a major contributor to the CF-Conventions and a great person to work with. Maybe there is a common interest here.
Our team is interested in open_mfdatatree
and save_mfdatatree
as well, but for the purpose of avoiding large files. Our total dataset size is hundreds of GB and so it would be nice to have a set of smaller netCDF files for each week of data.