zarr-specs
zarr-specs copied to clipboard
Use Arrow C data interface format strings?
Reading The Arrow C data interface I'm wondering if we should consider following any of the approach described there for the zarr v3.0 core protocol spec.
In particular, the format strings for core data types may be easier to handle than the currently used numpy-style format strings. Although unfortunately there is no concept of endianness.
cc: @DennisHeimbigner
I'm wondering if we should consider following any of the approach described there
I don't see any link to endianness in the arrow document, though it seem like we have both in current spec. Do you know know common it is to have different endianness in the same Zarr ?
Endianness is important given data may be produced on one system and consumed on another. I don't know of any cases where endianness differs between different arrays in the same hierarchy, that would probably be rare. But it would be nice to include this information within the array metadata.
As you know, both HDF5 and netcdf support setting endianness on a per-array basis. Practically speaking, I have never seen a netcdf file in which the endiannes differed across arrays. SInce I tend to support less complexity, I would think that specifying an endianness for the whole file only would be the way to go.
WRT format strings. The binary and large binary distinction seem odd to me. I assume it is trying to provide information about how to store the binary string. I would have said that the distinction is arbitrary and should be left to the implementation to decide. I also note that fixed-width binary is presumably the same as the HDF5 opaque type. We have found that this type has very little use as such and users tend to use arrays of uint8 for this.
WRT the time types, that has been an issue with netcdf for a while. Currently, times are stored as integers or strings and an attribute is used to specify its semantics.
Just as a note: This was also discussed in issue #131, and the data type names of the v3 spec were updated in PR #155 to be uint32
, float64
etc… Date and time related datatypes are not part of v3 core at the moment but added as an extension.
Also related to issue ( https://github.com/zarr-developers/numcodecs/issues/227 )