zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

DOC: zarr spec v3: adds optional dimensions and the "netZDF" format

Open shoyer opened this issue 6 years ago • 59 comments

xref #167

For ease of review, this is currently written by modifying docs/spec/v2.rst, but this would of course actually be submitted as a separate v3 file.

This does not yet include any changes to the zarr reference implementation, which would need to grow at least:

  • Array.dimensions
  • Group.dimensions
  • Group.resize for simultaneously resizing multiple arrays in a group which share the same dimension (conflicting dimension sizes are forbidden by the spec)
  • Group.netzdf for indicating whether a group satisfies the netzdf spec or not.

Note: I do like "netzdf" but I'm open to less punny alternatives :).

shoyer avatar Jul 15 '18 22:07 shoyer

Thanks @shoyer for writing this up. I had been using ZCDF as the acronym for this feature set in zarr but also don't have strong feelings about the name at this point. FWI, @alimanfoo, @rabernat, @mrocklin, and I have had a few offline exchanges on the subject (see https://github.com/jhamman/zcdf_spec for a Zarr subspec that describes what xarray has currently implemented). Without speaking for anyone else, I think there is growing excitement about the concept of a Zarr+NetCDF data model.

jhamman avatar Jul 16 '18 02:07 jhamman

Great to see this. I like the design, it's simple and intuitive. Couple of questions...

  1. Do you need a way to handle coordinate variables, i.e., being able to express the fact that an array contains the coordinates for some dimension?

  2. These features could be implemented within .zattrs without requiring any changes to the spec. I'm open to considering a spec change, and there may be other reasons for wanting to update the spec in the not-too-distant future (xref #244). But a spec change will mean some disruption for existing users and some minor complexities about supporting migration etc. Ultimately I'll be happy to follow the consensus but was just wondering what the rationale was for including these changes within the core metadata and not .zattrs.

alimanfoo avatar Jul 16 '18 08:07 alimanfoo

These features could be implemented within .zattrs without requiring any changes to the spec

I find myself agreeing with this. I think that ideally Zarr would remain low-level and that we would provide extra conventions/subspecs on top of it.

My understanding is that one reason for HDF's current inertia is that it had a bunch of special features glommed onto it by various user communities. If we can separate these out that might be nice for long term maintenance.

mrocklin avatar Jul 16 '18 12:07 mrocklin

Do you need a way to handle coordinate variables, i.e., being able to express the fact that an array contains the coordinates for some dimension?

No, for more sophisticated metadata needs we can simply use a subset of CF Conventions. These are pretty standard for applications that handle netCDF files, like xarray.

Ultimately I'll be happy to follow the consensus but was just wondering what the rationale was for including these changes within the core metadata and not .zattrs.

This is a good question. Mostly it comes down to having the specs all in one place, so it's obvious where to find this convention for everyone implementing the zarr spec. Dimensions are broadly useful enough for self-described data that I think people in many fields would find them useful. In particular, I would hate to see separate communities develop their own specs for dimensions, just because they didn't think to search for zarr netcdf.

I also think there are probably use cases for defining named dimensions on some but not all arrays and/or axes. This wouldn't make sense as part of the "netzdf" spec which xarray would require.

Finally, incorporating dimensions directly in the data model follows precedence from the netCDF format itself, which is actually pretty simple. I agree that we don't want to make zar as complex as HDF5 (which is part of why these aren't full fledged dimension scales) but adding a couple of optional metadata keys is only a small step in that direction.

shoyer avatar Jul 16 '18 16:07 shoyer

Have not thought about this too deeply yet. So this is just a very rough idea that we can discard if it doesn't fit, but what if we added ZCDF as a library within the org that built off Zarr? This would address some of the discoverability, and feature creep concerns raised thus far. It would also eliminate the need for things like checks as to whether the NetCDF spec is implemented by specific objects.

jakirkham avatar Jul 16 '18 16:07 jakirkham

If dimensions are applicable enough across other domains then I'm happy to relax my objections. I think that it would be useful to hear from people like @jakirkham (who comes from imaging) if this sort of change would be more useful or burdensome for his domain.

On Mon, Jul 16, 2018 at 12:25 PM jakirkham [email protected] wrote:

Have not thought about this too deeply yet. So this is just a very rough idea that we can discard if it doesn't fit, but what if we added ZCDF as a library within the org that built off Zarr? This would address some of the discoverability, and feature creep concerns raised thus far. It would also eliminate the need for things like checks as to whether the NetCDF spec is implemented by specific objects.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/pull/276#issuecomment-405304837, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszEmvz231TdfCyBIeX59m3iG3n4fiks5uHL5fgaJpZM4VQat7 .

mrocklin avatar Jul 16 '18 16:07 mrocklin

This is a good question. Mostly it comes down to having the specs all in one place, so it's obvious where to find this convention for everyone implementing the zarr spec. Dimensions are broadly useful enough for self-described data that I think people in many fields would find them useful. In particular, I would hate to see separate communities develop their own specs for dimensions, just because they didn't think to search for zarr netcdf.

FWIW we could add this as a "NetZDF spec" (or whatever name) alongside the existing storage specs in the specs section of the Zarr docs, should be pretty visible (in fact might be more visible as it would get its own heading in the toc tree).

I would be keen to minimise disruption for existing users and implementers if possible. A spec version change would imply some inconvenience, even if relatively small, as existing data would need to be migrated.

alimanfoo avatar Jul 16 '18 16:07 alimanfoo

My understanding is that this proposal is entirely compatible (both backwards and forwards) with existing data

On Mon, Jul 16, 2018 at 12:30 PM Alistair Miles [email protected] wrote:

This is a good question. Mostly it comes down to having the specs all in one place, so it's obvious where to find this convention for everyone implementing the zarr spec. Dimensions are broadly useful enough for self-described data that I think people in many fields would find them useful. In particular, I would hate to see separate communities develop their own specs for dimensions, just because they didn't think to search for zarr netcdf.

FWIW we could add this as a "NetZDF spec" (or whatever name) alongside the existing storage specs in the specs section of the Zarr docs http://zarr.readthedocs.io/en/stable/spec.html, should be pretty visible (in fact might be more visible as it would get its own heading in the toc tree).

I would be keen to minimise disruption for existing users and implementers if possible. A spec version change would imply some inconvenience, even if relatively small, as existing data would need to be migrated.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/pull/276#issuecomment-405306275, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszPmHbneBX9egqSrfcjmRmW0AJ0evks5uHL-IgaJpZM4VQat7 .

mrocklin avatar Jul 16 '18 16:07 mrocklin

My understanding is that this proposal is entirely compatible (both backwards and forwards) with existing data

Indeed. I considered naming this "v2.1" based on semantic versioning until I saw that the zarr spec only uses integer versions.

The only backwards incompatibility it introduces is the addition of new optional metadata fields. I would hope that any existing software would simply ignore these, rather than assume that no fields could ever be introduced in the future.

shoyer avatar Jul 16 '18 16:07 shoyer

Have not thought about this too deeply yet. So this is just a very rough idea that we can discard if it doesn't fit, but what if we added ZCDF as a library within the org that built off Zarr? This would address some of the discoverability, and feature creep concerns raised thus far. It would also eliminate the need for things like checks as to whether the NetCDF spec is implemented by specific objects.

Yes, this makes some amount of sense.

The main downside of incorporating these changes into zarr proper is that for netCDF compatibility we really want the guarantee of consistent dimension sizes between arrays. This would require a small amount of refactoring and additional complexity to achieve within the Zarr library.

shoyer avatar Jul 16 '18 16:07 shoyer

From a neuroscience data perspective, this gets pretty complicated pretty fast if one wants to be general. Please see NWB as an example. Personally wouldn't want Zarr to take any of this on. It would be better handled in a library on top of Zarr. Note NWB currently is built on top of HDF5, but it would be reasonable to consider an NWB spec on top of Zarr.

Can't speak to the Earth Sciences or what people in this field want out of Zarr. If dimensionality is the one thing desired, maybe this is ok. If there are 5 or 6 more things needed in the pipe, maybe having a library built on top of Zarr would be better. Would be good if some people could answer these sorts of questions.

jakirkham avatar Jul 16 '18 17:07 jakirkham

Sorry for the multiple posts. GitHub is having some sort of issue that is affecting me.

jakirkham avatar Jul 16 '18 17:07 jakirkham

Some miscellaneous thoughts about dimensionality in our field since Matt asked.

Naming dimensions has certainly come up before. Here is one example and another. Also some discussion about axes names in this comment and below. Scientist definitely like having this sort of feature as it helps them keep track of what something means and is useful if the order ever needs to change for an operation. So this sort of use case benefits from the proposal.

The other thing that typically comes to mind when discussing dimensions, which I don't think has come up thus far is units. It's pretty useful to know something is in ms, mV, or other relevant units. Libraries like quantities or pint are useful for tracking units and combining them sensible. This could be an addition to the proposal or perhaps something to add to a downstream data format library.

For tracking time in some cases we have timestamps. This supplants the need for dimension or units and often parallels other information (e.g. snapshots of other data at a particular time). This could use existing things like structured arrays.

However when looking applying some basic machine learning, dimensions pretty quickly become a mess. Especially if various different kinds of dimensions get mixed together. For example PCA is a pretty common technique to perform in a variety of cases to find the biggest contributors to some distribution. The units of this sort of thing are frequently strange and difficult to think about. This case probably either needs a different proposal or users have to work with existing metadata information to make this work for their use case.

jakirkham avatar Jul 16 '18 18:07 jakirkham

Also cc'ing @ambrosejcarr and @cadair to add some domain breadth to this discussion

On Mon, Jul 16, 2018 at 2:39 PM jakirkham [email protected] wrote:

Some miscellaneous thoughts about dimensionality in our field since Matt asked.

Naming dimensions has certainly come up before. Here is one example https://ukoethe.github.io/vigra/doc-release/vigranumpy/index.html#axistag-reference and another https://ukoethe.github.io/vigra/doc-release/vigranumpy/index.html#vigra.VigraArray.withAxes. Also some discussion about axes names in this comment and below https://github.com/imageio/imageio/issues/263#issuecomment-306718628. Scientist definitely like having this sort of feature as it helps them keep track of what something means and is useful if the order ever needs to change for an operation. So this sort of use case benefits from the proposal.

The other thing that typically comes to mind when discussing dimensions, which I don't think has come up thus far is units. It's pretty useful to know something is in ms, mV, or other relevant units. Libraries like quantities http://python-quantities.readthedocs.io/en/latest/user/tutorial.html or pint http://pint.readthedocs.io/en/latest/ are useful for tracking units and combining them sensible. This could be an addition to the proposal or perhaps something to add to a downstream data format library.

For tracking time in some cases we have timestamps. This supplants the need for dimension or units and often parallels other information (e.g. snapshots of other data at a particular time). This could use existing things like structured arrays.

However when looking applying some basic machine learning, dimensions pretty quickly become a mess. Especially if various different kinds of dimensions get mixed together. For example PCA is a pretty common technique to perform in a variety of cases to find the biggest contributors to some distribution. The units of this sort of thing are frequently strange and difficult to think about. This case probably either needs a different proposal or users have to work with existing metadata information to make this work for their use case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/pull/276#issuecomment-405336562, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLobBIjoFqexpqaY836lTs7dphKuks5uHN3CgaJpZM4VQat7 .

mrocklin avatar Jul 16 '18 18:07 mrocklin

From the point of view of the Unidata netcdf group, named dimensions (shared dimensions in netcdf parlance) is essential for managing coordinate variables. So the netcdf extension to Zarr (or possibly TIleDB) will include only named dimensions and anonymous dimensions will probably be suppressed. We went around about this with the HDF5 group long ago. One of the sticking points was multi-dimensional coordinate variables.

DennisHeimbigner avatar Jul 16 '18 20:07 DennisHeimbigner

Speaking as a user in the genomics domain, I certainly would find this feature useful, it is common to have multiple arrays sharing dimensions. I don’t have a broad experience in other domains but expect this feature to be generally very useful. So I am very supportive and would like to give this as much prominence as possible.

My reasons for leaning towards use of .zattrs is not meant in any way to diminish the importance or broad applicability of this feature, it is based purely on technical considerations, basically on what is easiest to implement and provides the least disruption for existing users and implementers.

My understanding is that this proposal is entirely compatible (both backwards and forwards) with existing data

Yes in theory, although unfortunately it’s not quite that simple in practice. I’ll try to unpack some details about versioning and change management in Zarr. Btw I’m not suggesting this ideal or the best solution, thinking ahead about possible changes and managing compatibility is quite hard.

This proposal adds a new feature (dimensions) to the Zarr storage spec. This feature is optional in two senses. First it is optional in that it specifies elements that do not need to be present in the array or group metadata. Second it is optional for the implementation, i.e., an implementation can ignore these elements if present in the metadata and still be conformant with the storage spec.

When I wrote the v2 storage spec and was thinking about mechanisms for managing change, for better or worse, I did not allow any mechanisms for adding optional features to the spec. There is no concept of minor spec versions, only major versions (single version number). The only way the spec can change is via a major version change, which implies a break in compatibility. If the current implementation finds anything other than “2” as the spec version number in array metadata, it raises an exception. The spec does not define any concept of optional features or leave open the possibility of introducing them (other than via a major version change).

If I had been farsighted, I might have seen this coming, and I might have defined a notion of optional features, which could be introduced via a minor version increment to the spec, and implementations could include some flexibility in matching the format minor version number when deciding if they can read some data or not. To be fair I did give this some thought, although I couldn’t have articulated it very well at the time. In the end I decided on a simple versioning policy I think partly because it was simple to articulate and implement, and also because I thought that the user attributes (.zattrs) always provided a means for optional functionality to be layered on. Also the separation between .zattrs and core metadata (.zarray, .zgroup) is nice in a way because it makes it very clear where is the line between optional and required features. I.e., to be conformant, a minimal implementation has to understand everything in .zarray, and can ignore everything in .zattrs.

So given all this, there are at least three options for how to introduce this feature. In below, by “old code” I mean the current version of the zarr package (which does not implement this feature), by “old data” I mean data created using old code, by “new code” I mean the next version of the zarr package (which does implement this feature), by “new data” I mean data created using new code.

Option 1: Use .zattrs, write this as a separate spec. Full compatibility, old code will be able to read new data, and new code will be able to read old data.

Option 2: Use .zarray/.zgroup, incorporate into the storage spec, major version bump (v3). Old code will not be able to read new data. New code can read old data if data is migrated (which just requires replacing the version number in metadata files) or if new code is allowed to read both v2 and v3.

Option 3: Use .zarray/.zgroup, incorporate into the storage spec but leave spec version unchanged (v2). Full compatibility, old code will be able to read new data, and new code will be able to read old data. However, this is potentially confusing because the spec has changed but the spec version number hasn’t.

Hence I lean towards option 1 because it has maximum compatibility and seems simplest/least disruptive. But very happy to discuss. And I’m sure there are other options too that I haven’t thought of.

alimanfoo avatar Jul 16 '18 21:07 alimanfoo

The other thing that typically comes to mind when discussing dimensions, which I don't think has come up thus far is units.

For tracking time in some cases we have timestamps.

@jakirkham - both of these issues arise in geoscience use cases. We handle them by providing metadata that follows CF conventions and then using xarray to decode the metadata into appropriate types (like a `numpy.datetime64'). This works today with zarr + xarray and doesn't require any changes to the spec.

rabernat avatar Jul 16 '18 22:07 rabernat

From the point of view of the Unidata netcdf group, named dimensions (shared dimensions in netcdf parlance) is essential for managing coordinate variables. So the netcdf extension to Zarr (or possibly TIleDB) will include only named dimensions and anonymous dimensions will probably be suppressed. We went around about this with the HDF5 group long ago. One of the sticking points was multi-dimensional coordinate variables.

Yes, this is why I want named dimensions.

I don't think we need explicit support for multi-dimensional coordinate variables in Zarr. NetCDF doesn't have explicit support for coordinates at all, and we get along just fine using CF conventions.

HDF5's dimension scales include coordinate values as well as dimension names. But in my opinion this is unnecessarily complex. Simple conventions like treating variables with the same name as a dimension as supplying coordinate values are sufficient.

shoyer avatar Jul 16 '18 22:07 shoyer

It might be useful for the discussion if I explain what xarray currently does to add dimension support to zarr stores. This might help clarify some of the tradeoffs between option 1 (just use .zattrs) vs. options 2/3.

When xarray creates a zarr store from an xarray dataset, it always creates a group. On each array in the group, it creates an attribute called _ARRAY_DIMENSIONS. The contents of the attribute are a list whose length is the same as the array ndim. The items correspond to the dimension name of each axis.

When the group is loaded, xarray checks for the presence of this key in the attributes of each array. If it is missing it raises an error--xarray can't read arbitrary zarr stores, only those that match its de-facto spec. If it finds the _ARRAY_DIMENSIONS, it uses it to populate the variable dimensions. (Xarray's internal consistency checks would raise an error if there were a conflict in sizes or if the dimension coordinate variables were not present in the group.) Xarray also has to hide this attribute from the user so that it can't be directly read or modified at the xarray level. This last step is a price we pay for the fact that the dimensions property is part of the "user space metadata" (.zattrs) rather than the core zarr metadata (.zarray).

rabernat avatar Jul 16 '18 22:07 rabernat

NetCDF doesn't have explicit support for coordinates at all I do not believe this is completely correct; there is no syntactic support, but if you look at the netcdf 3 and 4 specifications, it is part of the netcdf semantics.

DennisHeimbigner avatar Jul 16 '18 22:07 DennisHeimbigner

WRT to things like units, you need to be very careful about embedding domain specific semantics into the data model. Our experience is that this is best left to metadata conventions.

DennisHeimbigner avatar Jul 16 '18 22:07 DennisHeimbigner

Remember that the same dimension may be used in multiple variables, so it is probably not a good idea to attach dimension information (other than the name) to a variable.

DennisHeimbigner avatar Jul 16 '18 22:07 DennisHeimbigner

Just wanted to briefly chime in that I'm very happy to see NetCDF folks active in this discussion.

mrocklin avatar Jul 16 '18 22:07 mrocklin

BTW, one common example of multidimensional coordinate variables is when defining a trajectory data set.

DennisHeimbigner avatar Jul 16 '18 22:07 DennisHeimbigner

@alimanfoo I suspect this will not be the last change we will want in the zarr spec (e.g., to support complex numbers), so it might make sense to "bite the bullet" now with a major version number increase, and at the same time establish a clear policy on forwards compatibility for Zarr. I am confident that Zarr will only become more popular in the future!

I would suggest looking at the forwards compatibility policies from HDF5 and protocol buffers for inspiration:

  • HDF5 files write the minimum possible version number that supplies all needed features, to ensure that old clients can read files written with newer versions of the HDF5 library.
  • Protocol buffers are a domain specific language for writing custom file formats with automatically generated interfaces. It is widely used at Google and elsewhere (e.g., Apache Arrow uses a protocol buffer successor called flatbuffers). Hard experience has taught us that the right way to handle forward compatibility concerns is to ensure that protocol buffer implementations ignore but preserve unknown fields. Protocol buffers are designed to evolve by adding new fields, but changing the meaning of existing fields is strongly discouraged (this would correspond to a major version bump).

Going forward, I would suggest the following forward and backwards compatibility policies, which we can add to the spec:

  • Backwards compatibility: As much as practical, new versions of the Zarr library should support reading files generated with old versions of the spec (e.g., the zarr library version 3 should still support reading version 2 stores).
  • Forwards compatibility: New versions of the Zarr library should write the minimum possible version number.
  • Minor version numbers: The Zarr spec should specify versions with a string (e.g., "2.1") instead of a number. Minor version numbers indicate forwards compatible changes (e.g., the use of new optional features, such as dimension names). Older versions of the Zarr library should support reading newer files and simply ignore/preserve unknown fields. Issuing a warning would be appropriate.

Doing a little more searching, it appears that such a convention is actually widely used. E.g., see "Versioning Conventions" for ASDF and this page on "Designing File Formats" (it calls what I've described "major/minor" versioning).

shoyer avatar Jul 16 '18 22:07 shoyer

@DennisHeimbigner

NetCDF doesn't have explicit support for coordinates at all I do not believe this is completely correct; there is no syntactic support, but if you look at the netcdf 3 and 4 specifications, it is part of the netcdf semantics.

In the netCDF spec I find "coordinates" only mentioned for netCDF4, specifically for the _Netcdf4Coordinates. That said, I don't really understand why this exists: netCDF's public APIs (e.g., nc_def_var) don't reference coordinates at all.

I see that internally, netCDF4 maintains its own notion of "dimension scales" that support more than 1 dimension (beyond what HDF5 supports), which it appears to use for variables if their first dimension matches the name of the variable: https://github.com/Unidata/netcdf-c/blob/7196dfd6064d778a9973797200d8e64c999d63c5/libsrc4/nc4var.c#L594-L598

Note that this definition of a multi-dimensional coordinate does not even match the typical interpretation of "coordinates" by software that reads netCDF files. Per CF Conventions, "coordinates" are defined merely by being referenced by a "coordinates" attribute on anothe rvariable, without any requirements on their name matching a dimension.

I'm getting a little off track here, but I think the conclusions we can draw from the netCDF4 experience for Zarr are:

  1. Dimensions scales as implemented in HDF5 aren't even a particularly good fit for the netCDF data model, given the lengths to which netCDF4 had to go to adapt its data model to HDF5.
  2. We don't need to expose an explicit notion of coordinates (as understood by CF Conventions and xarray) for a netCDF-like API. This can be handled by downstream conventions.

shoyer avatar Jul 16 '18 23:07 shoyer

I stand corrected. One discussion of coordinate variables is here as a convention. https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_set_components.html#coordinate_variables That reference says that it is a convention, but the source code does take them into account. You are correct. We have long recignized that using dimension scales in the netcdf-4 code was probably a mistake and it contorts the code in a number of places.

The multi-dimensional issue is complex because it is most often used with what amounts to point data (like trajectories) and others have noted that indexed array storage is not very efficient at handling point data: relational systems work much better. So the multi-dim coordinate variable may be a red-herring wrt this discussion.

DennisHeimbigner avatar Jul 17 '18 00:07 DennisHeimbigner

A couple of other things.

  1. I think Stephan two point are correct as guiding principles.
  2. I think the simplest solution is to just implement simple named dimensions and allow variable definitions to reference them. Anonymous dimensions would continue to exist.

p.s. there is one other thing about named dimensions. If they can be defined in any group and referenced from a variable in some other group, then some kind of fully qualified name probably needs to be defined so that it is possible to unambiguously reference a dimension.

DennisHeimbigner avatar Jul 17 '18 01:07 DennisHeimbigner

To add my perspective: We're dealing with NetCDF files a lot and having Zarr support NetCDF could simplify things. We already use zattrs named after the NetCDF format.

I do like the simplicity in Zarr and there is value in preserving that. As it looks like the functionality can easily be implemented as a wrapper on top of Zarr I'd opt for that. This way only people interested in NetCDF get exposed to the added complexity.

sbalmer avatar Jul 17 '18 08:07 sbalmer

@shoyer thanks for the links and comments re compatibility, very helpful, I'll do some reading. Yes if there are other changes needed to the spec then it may be worth biting the bullet. For reference here are all the other issues I'm aware of that would require some kind of spec change: #267, #244, #216, #111. I'd like to mull this over for a bit. Allowing some mechanism of forwards compatibility seems very sensible in principle, but I'd like to work that all through to understand all the consequences for implementation. And this is not a counterargument, but just noting that .zattrs effectively provides a "preserve but ignore" mechanism already. And regardless of how dimensions ends up getting implemented, we may want some mechanism for standardising metadata conventions like units etc., i.e., things that are effectively optional modules that people may want to mix-and-match.

If we do end up with consensus for a spec change, it would be good to consult the z5py developers, I think they are close to a 1.0 release which implements the zarr v2 spec. And anyone else who is working on or considering an implementation.

On a side point, I would be happy for zarr not to implement HDF5 dimension scales. The complexities associated with keeping the bidirectional references in a consistent state seem hard to manage.

alimanfoo avatar Jul 17 '18 09:07 alimanfoo