NCZarr - Netcdf Support for Zarr
I am moving the conversation about NCZarr to its own issue. See Issue https://github.com/zarr-developers/zarr/issues/317 for initial part of this discussion.
Naming issue: I have about convinced myself that rather than creating KVP level objects like .zdimensions, I should just use the existing Zarr attribute mechanism. In order to do this, it is necessary to setup some naming conventions for such attributes. Basically, we need to identify that an attribute is special (and probably hidden) and for which extension(s) it applies. For NCZarr, let me propose this:
- All such attributes start with two underscores
- Next is a 2-4 character tag specific to the extension: "NCZ" for NCZarr.
- another underscore
- the rest of the attribute name.
So, we might have "__NCZ_dimensions" instead of .zdimensions.
Thanks for opening this @DennisHeimbigner.
Encountered issue ( https://github.com/zarr-developers/zarr/issues/280 ) again recently. So figured that might interest you given some of this discussion about how to manage specs. Though each issue has its own place I think.
If we do go down the attribute road, agree that having some non-conflicting name convention is important. The other option might be to widen the spec of things like .zarray to allow specs subclassing Zarr's spec to add additional relevant content here as others have mentioned. A third option similar to what you have done would be to add something like .zsubspec, which users can fill as needed. We might need certain keys in there like subspec name, subspec version, etc., but otherwise leave it to users to fill these out as needed.
Thanks @DennisHeimbigner.
Just to add that, on the question of whether to pack everything into attributes (.zattrs) or whether to store metadata separately under other store-level keys (.zdims, .ztypdefs, etc.), I think both are reasonable and personally I have no objection to either.
I lean slightly towards using attributes (.zattrs) because it plays nicely with some existing API features. E.g., the NCZ metadata can be accessed directly via the attributes API. And, e.g., the NCZ metadata would get included if using consolidated metadata, which is an experimental approach to optimising cloud access, available in the next release of Zarr Python. But neither of these are blockers to the alternative approach, because it is straightforward to read and decode JSON objects directly from a store, and it would also be straightforward to modify the consolidated metadata code to include other objects.
We have learned from the existing netcdf-4 that datasets exist with very large (~14mb) metadata. I was looking at the Amazon S3 query capabilities and they are extremely limited. So the idea of consolidated metadata seems like a very good idea. This reference: https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata does not provide any details of the form of the proposed consolidated metadata. Note that there may not be any point in storing all of the metadata, especially if lazy reading of metadata is being used (as it is in the netcdf-4 over hdf5 implementation). Rather I think that what is needed is just a skeleton so that query is never needed: we would consolidate the names and kinds (group, variable, dimension, etc) and leave out e.g. attributes and variable types and shapes.
here is a proposed consolicated metadata structure for NCZarr. It would be overkill for standard Zarr, which is simpler. Sorry if it is a bit opaque since it is a partial Antlr grammar. nczmetadata.txt
We have learned from the existing netcdf-4 that datasets exist with very large (~14gb) metadata.
Wow, that's big. I think anything near that size will be very sub-optimal in zarr, because of metadata being stored as uncompressed JSON documents. I wonder if in cases like that, it might be necessary to examine what is being stored as metadata, and if any largish arrays are included then consider storing them as arrays rather than as attributes.
I was looking at the Amazon S3 query capabilities and they are extremely limited. So the idea of consolidated metadata seems like a very good idea. This reference: https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata does not provide any details of the form of the proposed consolidated metadata.
Apologies the format is not documented as yet. There's an example here:
https://github.com/zarr-developers/zarr/pull/268#issuecomment-435621394
That was a typo. The correct size is 14 mb.
That was a typo. The correct size is 14 mb.
Ah, OK! Although 14MB is still pretty big, it's probably not unmanageable.
Depends on what manageable means, I suppose. We have situations where projects are trying to load a small part of the metadata from thousands of files each of which has the amount of metadata. Needless to say, this is currently very slow. We are trying various kinds of optimizations around lazy loading of metadata but the limiting factor will be HDF5. A similar situation is eventually going to occur here, so thinking about various optimizations is important.
Depends on what manageable means, I suppose. We have situations where projects are trying to load a small part of the metadata from thousands of files each of which has the amount of metadata. Needless to say, this is currently very slow. We are trying various kinds of optimizations around lazy loading of metadata but the limiting factor will be HDF5. A similar situation is eventually going to occur here, so thinking about various optimizations is important.
That's helpful to know.
FWIW the consolidated metadata feature currently in zarr python was developed for the xarray use case, where the need (as I understand it) is to load all metadata up front. So that feature combines the content from all .zarray, .zgroup and .zattrs objects from the entire group and dataset hierarchy into a single object, which can then be read from object storage in a single HTTP request.
If you have use cases where you have a large amount of metadata but only need to read parts of it at a time, that obviously might not be optimal. However, 14MB is not an unreasonable amount to load from object storage, would probably be fine to do interactively (IIRC bandwidth to object storage from compute nodes within the same cloud is usually ~100MB/s).
I'm sure there would be other approaches that could be taken too that could support partial/lazy loading of metadata. Happy to discuss at any point.
Are you able to provide data on where most of the time is being spent, @DennisHeimbigner?
Issue: Attribute Typing I forgot to address one important difference between the netcdf-4 model and Zarr: attribute typing. In netcdf-4, attributes have a defined type. In Zarr, attributes are technically untyped, although in some case it is possible to infer a type from the value of the attribute.
This is most important with respect to the _FillValue attribute for a variable. There is an implied constraint (in netcdf-4 anyway) that the type of the attribute must be the same as the type of the corresponding variable. There is no way to guarantee this for Zarr except by doing inferencing.,
Additionally, if the variable is of a structured type, there is currently no standardized way to define the fill value for such a type nor is there a way to use structured types with other, non-fillvalue, attributes.
Sadly, this means that NCZarr must add yet another attribute that specifies the types of other attributes associated with a group or variable.
Hi @DennisHeimbigner,
Regarding the fill value specifically, the standard metadata for a zarr array includes a fill_value key. There are also rules about how to encode fill values to deal with values that do not have a natural representation in JSON. This includes fill values for arrays with a structured dtype. If possible, I would suggest to use this feature of standard array metadata, rather than adding a separate _FillValue attribute. If not, please do let us know what's missing, that would be an important piece of information to carry forward when considering spec changes.
Regarding attributes in general, we haven't tried to standardise any method to encode values that do not have a natural JSON representation. Currently it is left to the application developer to decide their own method for encoding and decoding values as JSON, e.g., I believe xarray has some logic for encoding values in zarr attributes. There has also been some discussion of this at #354 and #156.
Ultimately it would be good to standardise some conventions (or at least define some best practices) for representing various common value types in JSON, such as typed arrays. I'm more than happy for the community to lead on that.
This reference -- https://zarr.readthedocs.io/en/stable/spec/v2.html#fill-value-encoding -- does not appear to address fill values for structured types. Did you get the reference wrong?
If an array has a fixed length byte string data type (e.g., "|S12"), or a structured data type, and if the fill value is not null, then the fill value MUST be encoded as an ASCII string using the standard Base64 alphabet.
I.e., use base 64 encoding.
So it would be nice if we had a defined language-independent algorithm that defines how to construct the fill value for all possible struct types (including recursion for nested structs). This should be pretty straightforward/ Also, why force a string (base64) encoding? Why not make the fill value be just another Json structure? It worries me how python-specific is much of the spec around types.
So it would be nice if we had a defined language-independent algorithm that defines how to construct the fill value for all possible struct types (including recursion for nested structs). This should be pretty straightforward
That would be good. I believe numpy mimics C structs, further info here.
Looking again at the numpy docs, there is support for an align keyword when constructing a structured dtype, which changes the itemsize and memory layout. This hasn't been accounted for in the zarr spec, I suspect that things are currently broken if someone specifies align=True (default is False).
Also, why force a string (base64) encoding? Why not make the fill value be just another Json structure?
That's a nice idea, would fit with the design principle that metadata are human-readable/editable.
It worries me how python-specific is much of the spec around types.
The zarr spec does currently defer to numpy as much as possible, assuming that much of the hard thinking around things like types has been done there already.
If there are clarifications that we could make to the v2 spec that would help people develop compatible implementations in other languages then I'd welcome suggestions.
Thinking further ahead to the next iteration on the spec, it would obviously be good to be as platform-agnostic as possible, however it would also be good to build on existing work rather than do any reinvention. The work on ndtypes may be relevant/helpful there.
Surfacing here notes on the NetCDF NCZarr implementation, thanks @DennisHeimbigner for sharing.
Also relevant here, documentation of xarray zarr encoding conventions, thanks @rabernat.
@DennisHeimbigner: It looks like Unidata's Netcdf C library can now read data with the xarray zarr encoding conventions, right?
@rabernat, should I raise an issue for xarray to also support the Unidata NcZarr conventions?
The ability to read xarray is in the main branch, and will be in the upcoming 4.8.1 release. I am shaving the yak to get our automated regression and integration test infrastructure back up and running but we hope to have 4.8.1 out shortly.
@rabernat, should I raise an issue for xarray to also support the Unidata NcZarr conventions?
I see this as very difficult. The reason is that the ncZarr conventions use files outside of the zarr hierarchy. We would probably need to implement a compatibility layer as a third-party package, similar to h5netcdf.
p.s. but yes, please open an xarray issue to keep track of it.
One thing I'll note on Xarray's convention for the Zarr is that we will likely change things in the near future to always write and expect "consolidated metadata" (see https://github.com/pydata/xarray/issues/5251). This is almost completely backwards compatible, but if NcZarr writes these consolidated metadata fields in Xarray compat mode we could load these Zarr stores much quicker in Xarray.
Consolidated metadata would probably be a nice feature for NcZarr, too, because it reduces the number of files that need to be queried for metadata down to only one. I think there was a similar intent behind the .nczgroup JSON field. Consolidated metadata is sort of a super-charged version of that.
NCZarr get similar improvement by doing lazy read of metadata objects. That is one problem with _ARRAY_DIMENSIONS -- it requires us to read all attributes even if otherwise unneeded. NCZarr avoids this by keeping the dimension names separate. As for consolidated metadata, I assume you are NOT saying that any pure zarr container that does not contain the consolidated metadata will be unreadable by Xarray.
NCZarr get similar improvement by doing lazy read of metadata objects. That is one problem with _ARRAY_DIMENSIONS -- it requires us to read all attributes even if otherwise unneeded. NCZarr avoids this by keeping the dimension names separate.
In Xarray, we have to read nearly all the metadata eagerly to instantiate xarray.Dataset objects.
As for consolidated metadata, I assume you are NOT saying that any pure zarr container that does not contain the consolidated metadata will be unreadable by Xarray.
This is correct, you don't need to write consolidated metadata. But if you do, Xarray will be able to read the data much faster.
As for whether netCDF users would notice a difference with consolidated metadata, I guess it would depend on their use-cases. Lazy metadata reads are great, but for sure it is faster to download a single small file than to download multiple files in a way that cannot be fully parallelized, even if they add up to the same total size.
faster to download a single small file than to download multiple files
true, but we have use cases where the client code is walking a large set of netcdf files and reading a few pieces of information out of each of them and where the total metadata is large (14 megabytes). This can occur when one has a large collection of netcdf files covering some time period and each netcdf file is a time slice (or slices). Perhaps Rich Signell would like to comment with his experience.
https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-833024978 I see this as very difficult. The reason is that the ncZarr conventions use files outside of the zarr hierarchy. We would probably need to implement a compatibility layer as a third-party package, similar to h5netcdf.
For what it's worth, I could see making some movement (June-ish?) on https://github.com/zarr-developers/zarr-specs/issues/112#issuecomment-825690209 to permit the additional files. But either way, certainly https://github.com/ome/ngff/pull/46#pullrequestreview-652899174 (related issue) would suggest hammering out a plan for this difference before another package introduces a convention.
https://github.com/zarr-developers/zarr-specs/issues/41#issuecomment-833036094 One thing I'll note on Xarray's convention for the Zarr is that we will likely change things in the near future to always write and expect "consolidated metadata" (see pydata/xarray#5251). This is almost completely backwards compatible, but if NcZarr writes these consolidated metadata fields in Xarray compat mode we could load these Zarr stores much quicker in Xarray.
Having gone through https://github.com/pydata/xarray/issues/5251 I'm slightly less worried about this then when I first read it (assuming it mean it would only support consolidated metadata), but having just spent close to 2 months trying to get dimension_separator "standardized", I'd like to raise a flag that consolidated metadata is a similar gray area. It'd be nice to get it nailed down.
@DennisHeimbigner, just a quick comment that I too always use consolidated metadata when writing Zarr. Here's a recent example with coastal ocean model output we are publishing, where consolidated metadata is an order of magnitude faster to open:
Note that the issue for me is: for what use-cases is lazy metadata download better than consolidated metadata. The latter is better in the cases where you know that you need to access almost all of the meta-data or where the total size of the metadata is below some (currently unknown) size threshold. My speculation is that the access patterns vary all over the place and are highly domain dependent. I infer that Rich's use case is one where all the metadata is going to be accessed.
In any case, once .zmetadata is well-defined (see Josh's previous comment) I will be adding it to nczarr. However, we will probably give the user the choice to use it or not if lazy download makes more sense for their use-case.
On the other side, it seems to me that zarr-python might profitably explore lazy download of the metadata.
Note that the issue for me is: for what use-cases is lazy metadata download better than consolidated metadata. The latter is better in the cases where you know that you need to access almost all of the meta-data or where the total size of the metadata is below some (currently unknown) size threshold. My speculation is that the access patterns vary all over the place and are highly domain dependent. I infer that Rich's use case is one where all the metadata is going to be accessed.
Agree! I'm sure there are cases where using consolidated metadata is not a great idea, though my guess is that they are relatively rare.
In any case, once .zmetadata is well-defined (see Josh's previous comment) I will be adding it to nczarr. However, we will probably give the user the choice to use it or not if lazy download makes more sense for their use-case.
Sounds great, thanks!
On the other side, it seems to me that zarr-python might profitably explore lazy download of the metadata.
As I understand it, this is already the case in Zarr-Python (if not using consolidated metadata). It's just that lazy metadata does not work for Xarray.
In particular, I think there is definitely a place for including an explicit "index" of arrays in a group that doesn't require potentially expensive directory listing. Hopefully this is already in the draft v3 spec (I haven't checked).