xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Feature Request: Hierarchical storage and processing in xarray

Open emilbiju opened this issue 5 years ago • 60 comments

I am using xarray for processing geospatial data and have encountered two major challenges with existing data structures in xarray:

  • Data arrays stored in an xarray Dataset cannot be grouped into hierarchical levels/logical subsets to reflect the internal organisation of the data. This makes it difficult to identify and process a subset of the data variables that pertain to a specific problem.

  • When two data arrays having a shared dimension but different coordinate values along the dimension are merged into a Dataset, the union of coordinate values from the 2 data arrays becomes the new coordinate set corresponding to that dimension. Consequently, when the value of a variable in the dataset corresponding to a coordinate value is unknown, nan is used as a substitute which results in memory wastage.

I would like to suggest a tree-based data structure for xarray in which the leaves store individual data arrays and the other nodes store the hierarchical information. Since data arrays are stored independently, each dimension only needs to be associated with coordinate values that are valid for that data array.

To meet these requirements, I have implemented a data structure that also supports the below capabilities:

  • Standard xarray methods can be applied to the tree at all hierarchical levels, i.e., when a function is called at a hierarchical level, it is mapped over all data arrays that occur at the leaves under the corresponding node. For example, say I have a tree object (lets call it dt) with child nodes: weather, satellite image and population. Each of these nodes has data arrays/subtrees under it.

Screenshot 2020-06-02 at 2 10 28 AM

The mean over time of all data variables associated with weather can be obtained using dt.weather.mean('time') which applies the function to sea_surface_temperature, dew_point_temperature, wind_speed and pressure.

  • It can be encoded into the netCDF format, like xarray Datasets.
  • It supports item assignment at all hierarchical levels.

I would like to know of the possibility of introducing such a data structure in xarray and the challenges involved in the same.

emilbiju avatar Jun 01 '20 20:06 emilbiju

@emilbiju - thanks for opening an issue here. You may want to take a look at the conversation in #1092.

jhamman avatar Jun 01 '20 22:06 jhamman

Thanks @jhamman for sharing the link. Here are my thoughts on the same:

For use-cases similar to the one I have mentioned, I think it would be more meaningful to allow the tree structure (calling it Datatree further) to exist as a separate data structure instead of residing within the Dataset. From what I understand, the xarray Dataset would enforce all its component variables to share the same coordinate set for a given dimension name. This would again result in memory wastage with nan values when the value corresponding to a coordinate is unknown.

Besides, xarray only allows attribute access for getting (and not setting) values, but a separate data structure can allow attribute access for setting values as well. For example, the data structure that I have implemented would allow something like dt.weather = dt.weather.mean('time') to alter all the data arrays under the weather node.

I am currently using attribute-based access for accessing child nodes/data arrays in the Datatree as it appears to reflect the tree structure better, but as @shoyer has pointed out, tuple-based access might be easier to use programmatically.

Instead of using netCDF4 groups for encoding the Datatree, I am currently following a simple 3-step process:

  • Combine all the data arrays at the leaves of a Datatree object into a dataset.
  • Add an additional data array to the dataset that would contain an ancestor matrix (or any other array-like representation) that can encode the hierarchical structure with a coordinate set containing names of the tree nodes.
  • Use the xarray.Dataset.to_netcdf method to store it in a netCDF file.

Therefore, within the netCDF file, it would exist just as a Dataset. A specially implemented Datatree.open_datatree method can open the dataset, detect this additional array and recreate the tree structure to instantiate the object. I would like to know if using netCDF4 groups instead provide any advantages over this approach?

emilbiju avatar Jun 02 '20 08:06 emilbiju

Thanks for writing this up @emilbiju . These are very interesting ideas

  1. The nice thing about using NetCDF groups (or HDF5?) is that it is a standard and your data files are readable using other software.

  2. So far, xarray has been reluctant to add "groups" or this kind of hierarchical organization because of all the additional complexity involved (#1092)

  3. That said, there is definitely interest in a package that provides a high-level object composed of multiple xarray datasets (again #1092). So I encourage you to post your code online so others can try it out and iterate.

    a. For example, our friends over at Arviz have a InferenceData structure composed of multiple Datasets that is represented on-disk using NetCDF groups: https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

image

dcherian avatar Jun 02 '20 16:06 dcherian

I would be open to exploring adding a hierarchical data structure into xarray (on an experimental basis, to start), but it would need someone with serious interest and time to make it happen. Certainly there are plenty of use cases across various fields.

shoyer avatar Jun 03 '20 21:06 shoyer

The data model you sketch out here looks very similar to what we discussed in #1092. I agree that the semantics are well defined.

The main question in my mind is whether it would make more sense to make an entirely new data structure (e.g., xarray.TreeDataset) or add in a new feature like groups to the existing xarray.Dataset.

Probably a new data structure would be easier at this point, because would keep Dataset simpler and wouldn't break existing code that works on xarray.Dataset.

shoyer avatar Jun 03 '20 21:06 shoyer

@joshmoore - based on https://github.com/pangeo-forge/pangeo-forge/pull/27#issuecomment-755397835, you may be interested in this issue. One way to do multiscale datasets in Xarray would be to use hierarchical groups (one group per scale).

jhamman avatar Jan 06 '21 18:01 jhamman

a. For example, our friends over at Arviz have a InferenceData structure composed of multiple Datasets that is represented on-disk using NetCDF groups: https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

Just a note that this link has moved to: https://arviz-devs.github.io/arviz/getting_started/XarrayforArviZ.html

davidbrochart avatar Jan 07 '21 09:01 davidbrochart

Thanks for the link, @jhamman. The most immediate issue I ran into when trying to use xarray with OME-Zarr data does seem similar. A rough representation of one multiscale image is:

image_pyramid:
  |_ zyx_array_high_res
  |_ zyx_array_mid_res
  |_ zyx_array_low_res

but of course the x, y and z dimensions are of different sizes in each volume.

joshmoore avatar Jan 07 '21 15:01 joshmoore

@jhamman @joshmoore a prototype to bring together XArray and OME-Zarr/NGFF with multiple groups: https://github.com/OpenImaging/miqa/blob/master/server/scripts/compress_encode.py

thewtex avatar Feb 10 '21 15:02 thewtex

On today's Xarray dev call, we discussed pursuing another CZI grant to support this feature in Xarray. The image pyramid use case would provide a strong link to the bioimaging community. @alexamici and the B-open folks seem enthusiastic.

I had to leave the meeting early, so I didn't hear the end of the conversation. But did we decide who might serve as PI for such a proposal?

rabernat avatar Mar 17 '21 16:03 rabernat

But did we decide who might serve as PI for such a proposal?

No.

@emilbiju are you interested in open-sourcing your work?

dcherian avatar Mar 17 '21 17:03 dcherian

FWIW, a while ago I wrote a mock-up (and probably outdated) DatasetNode class:

https://gist.github.com/benbovy/92e7c76220af1aaa4b3a0b65374e233a (nbviewer link)

benbovy avatar Mar 18 '21 09:03 benbovy

This is related to some very recent work we have been doing at NSLS-II, primarily lead by @danielballan .

tacaswell avatar Mar 19 '21 14:03 tacaswell

Not really sure if there is anything we can do from ArviZ to help with that, if there is let us know and we'll do our best cc @percygautam

OriolAbril avatar Mar 23 '21 07:03 OriolAbril

@alexamici and I can write the technical part of the proposal.

aurghs avatar Mar 25 '21 06:03 aurghs

Happy to provide assistance on the image pyramid (i.e. "multiscale") use case.

joshmoore avatar Mar 25 '21 09:03 joshmoore

So we have:

  • Numerous promising prototypes to draw from
  • A technical team who can write the proposal and execute the proposed work (@aurghs & @alexamici of B-open)
  • Numerous supporting use cases from the bioimaging (@joshmoore), condensed matter (@tacaswell), and bayesian modeling (ArviZ; @OriolAbril) domains

We are just missing a PI, someone who is willing to put their name on top of the proposal and click submit. I have gone on record as committed to not leading any new proposals this year. And in any case, this is a good opportunity for someone else from the @pydata/xarray core dev team to try on a leadership role.

rabernat avatar Mar 25 '21 13:03 rabernat

I volunteer to contribute writing to this from the condensed matter / sychrotron user facility perspective.

danielballan avatar Mar 25 '21 13:03 danielballan

I can shoulder part of the load and help is definitely needed. LOI is due on Tuesday. I'll take a stab this evening and post a link.

dcherian avatar Mar 25 '21 15:03 dcherian

Here are some biomedical papers that are using ArviZ and therefore xarray even if most don't cite xarray and some don't cite ArviZ either. Topics are quite disperse: covid, psychology, biomolecules, oncology...

Some ArviZ recent biomedical citations
  • Arroyuelo, A., Vila, J., & Martin, O. A. (2020). Exploring the quality of protein structural models from a Bayesian perspective. bioRxiv.
  • Axen, S. D. (2020). Representing Ensembles of Molecules (Doctoral dissertation, UCSF).
  • Brauner, J. M., Mindermann, S., Sharma, M., Johnston, D., Salvatier, J., Gavenčiak, T., ... & Kulveit, J. (2021). Inferring the effectiveness of government interventions against COVID-19. Science, 371(6531).
  • Busch-Moreno, S., Tuomainen, J., & Vinson, D. (2020). Trait Anxiety Effects on Late Phase Threatening Speech Processing: Evidence from EEG. bioRxiv.
  • Busch-Moreno, S., Tuomainen, J., & Vinson, D. (2021). Semantic and prosodic threat processing in trait anxiety: is repetitive thinking influencing responses?. Cognition and Emotion, 35(1), 50-70.
  • Dehning, J., Zierenberg, J., Spitzner, F. P., Wibral, M., Neto, J. P., Wilczek, M., & Priesemann, V. (2020). Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions. Science, 369(6500).
  • Heilbron, E., Martìn, O., & Fumagalli, E. (2020). Efectos protectores de los alimentos andinos contra el daño producido por el alcohol a nivel del epitelio intestinal, una aproximación estadística. Ciencia, Docencia y Tecnología, 31(61 nov-mar).
  • Legrand, N., Nikolova, N., Correa, C., Brændholt, M., Stuckert, A., Kildahl, N., ... & Allen, M. (2021). The heart rate discrimination task: a psychophysical method to estimate the accuracy and precision of interoceptive beliefs. bioRxiv.
  • Wang, Y. (2020, September). Data Analysis of Psychological Measurement of Intelligent Internet-assisted Sports Training based on Bio-Sensors. In 2020 International Conference on Smart Electronics and Communication (ICOSEC) (pp. 474-477). IEEE.
  • WASSERMAN, A., SHRAGER, J., & SHAPIRO, M. A Multilevel Bayesian Model for Precision Oncology.
  • Weindel, G., Anders, R., Alario, F. X., & Burle, B. (2020). Assessing model-based inferences in decision making with single-trial response time decomposition. Journal of Experimental Psychology: General.
  • Yamagata, Y. (2020). Simultaneous estimation of the effective reproducing number and the detection rate of COVID-19. arXiv e-prints, arXiv-2005.

OriolAbril avatar Mar 26 '21 02:03 OriolAbril

I'm excited to see this coming together! I would be happy to advise as well...

Side note: at some point, this would probably be worth adding to Xarray's official roadmap.

shoyer avatar Mar 26 '21 03:03 shoyer

We could also provide a use-case in remote sensing: it would be really useful in the interferometric processing for managing Sentinel-1 IW and EW SLC data, which has multiple tiles (burts) partially overlapping in one direction (azimuth).

aurghs avatar Mar 26 '21 09:03 aurghs

This sounds like an interesting project - I'm also about to be able to work on xarray much more directly (thanks @rabernat ).

Should I add this as another xarray project board alongside explicit indexes and so on?

I wonder if this could find another domain use case in plasmapy as part of the overall plasma object @StanczakDominik? At the very least this would allow you to store all the various equilibrium and diagnostics information that goes in an EFIT file.

TomNicholas avatar Mar 26 '21 16:03 TomNicholas

Whoa, that sounds awesome! Thanks for the heads up :) Definitely could be quite handy, looking forward to seeing how this develops. @rocco8773 this should be interesting for you as well :)

StanczakDominik avatar Mar 27 '21 08:03 StanczakDominik

For scientific imaging, i.e. biomicroscopy, medical imaging, where xarray compatibility is being considered in the NGFF, it would be helpful to avoid unnecessary divergence by ensuring the proposed hierarchical storage is compatible. This would mean:

  1. Each scale / group can be independently treated as an xarray.Dataset.
  2. They are organized in such a way that the collection of scales can be referenced as it is now, i.e. as a collection of paths,
  “multiscales”: [
    {
      “datasets” : [
          {"path": "0"},
          {"path": "1"},
          {"path": "2"},
          {"path": "3"},
          {"path": "4"}
        ]
      “version” : “0.1”
    }
  ]
}

thewtex avatar May 06 '21 13:05 thewtex

Picking up on @dcherian's https://github.com/pydata/xarray/issues/4118#issuecomment-806954634 and @rabernat's https://github.com/ome/ngff/issues/48#issuecomment-833456889, Zarr was also accepted to the second round and certainly references this issue in case we want to sync up. (Apologies if I missed where that discussion moved.)

joshmoore avatar May 06 '21 14:05 joshmoore

A simple comment/question:

In xarray.Dataset, why not just use the Unix-path notation into a "flat" dict model?

Actually, netCDF4 implements this Unix-like path access to groups and variables: /path/to/group/variable.

All of the hierarchical stuff (e.g., getting a sub-Dataset from a random group) and conventions (e.g., dimensions scoping rule) would then be driven by the parsing of strings only. It's all about symbolic names (like in a file system right?) and there would be not any hierarchical data in memory anymore.

My question is then: Are there some tricky points for xarray.Dataset not to go this simple way?

Some related remarks:

  • About the attribute access to variables: I don't really know why this exist at all since it is all about mixing unrelated namespaces: (1) the class internals and (2) the user's variables one. Mixing namespaces seems very bad to me: it makes some variable names forbidden in order to avoid any collision between the two namespaces, it usually imply unnecessarily complex code with corner cases to deal with.
  • About netCDF4 being a self-described format: xarray API has open_dataset(filepath), but this function is unable to read the whole file in memory without getting help from a priori file content description, i.e., the names of the groups if you follow me. Considering xarray for simple tasks like geographical-selection-cropping, it seems to ignore the self-describing nature of netCDF4 format. As far as I can understand the situation, a "flat" model could be a good way to go.

nbercher avatar May 21 '21 10:05 nbercher

cc @d-v-b and https://github.com/JaneliaSciComp/xarray-multiscale

dcherian avatar May 21 '21 17:05 dcherian

Flagging another possible use case, this time in Magnetic Confinement Fusion: representing the IMAS data model.

IMAS is currently closed-source (being part of the ITER project), but there is a big push to make it open-source and the standard data model for tokamak plasma data.

I'm not very familiar with IMAS (@smithsp and @orso82 are more so), but it is hierarchical. There is some more information in appendix A3 of this paper, which talks about "taking advantage of the homogeneity of grid sizes that is commonly found across arrays of structures", which sounds very closely related to the DataTree proposal.

This might allow the xarray.DataTree to do more of the heavy-lifting within OMAS (which already uses xarray, and is intended to be compatible with IMAS).

TomNicholas avatar Jul 02 '21 18:07 TomNicholas

@martinitus raises a really interesting point about tags vs hierarchical structures over in https://github.com/pydata/xarray/issues/1092#issuecomment-868324949

However, one point I didn't see in the discussion is the following:

Hierarchical structures often force a user to come up with some arbitrary order of hierarchy levels. The classical example is document filing: do you put your health insurance documents under /insurance/health/2021, 2021/health/insurance,....?

One solution to that is a tagging of documents instead of putting them into a hierarchy. This would give the full flexibility to retrieve any flat DataSet out of a TaggedDataSet by specifying the set of tags that the individual DataArrays must be listed under.

I think using tags is a really interesting alternative to hierarchies. I don't have a clear sense of the overall tradeoffs, though.

shoyer avatar Jul 02 '21 19:07 shoyer