zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Use Arrow C data interface format strings?

Open alimanfoo opened this issue 4 years ago • 8 comments

Reading The Arrow C data interface I'm wondering if we should consider following any of the approach described there for the zarr v3.0 core protocol spec.

In particular, the format strings for core data types may be easier to handle than the currently used numpy-style format strings. Although unfortunately there is no concept of endianness.

alimanfoo avatar Apr 28 '20 07:04 alimanfoo

cc: @DennisHeimbigner

joshmoore avatar Apr 28 '20 09:04 joshmoore

I'm wondering if we should consider following any of the approach described there

I don't see any link to endianness in the arrow document, though it seem like we have both in current spec. Do you know know common it is to have different endianness in the same Zarr ?

Carreau avatar May 05 '20 18:05 Carreau

Endianness is important given data may be produced on one system and consumed on another. I don't know of any cases where endianness differs between different arrays in the same hierarchy, that would probably be rare. But it would be nice to include this information within the array metadata.

alimanfoo avatar May 06 '20 12:05 alimanfoo

As you know, both HDF5 and netcdf support setting endianness on a per-array basis. Practically speaking, I have never seen a netcdf file in which the endiannes differed across arrays. SInce I tend to support less complexity, I would think that specifying an endianness for the whole file only would be the way to go.

DennisHeimbigner avatar May 06 '20 15:05 DennisHeimbigner

WRT format strings. The binary and large binary distinction seem odd to me. I assume it is trying to provide information about how to store the binary string. I would have said that the distinction is arbitrary and should be left to the implementation to decide. I also note that fixed-width binary is presumably the same as the HDF5 opaque type. We have found that this type has very little use as such and users tend to use arrays of uint8 for this.

DennisHeimbigner avatar May 06 '20 15:05 DennisHeimbigner

WRT the time types, that has been an issue with netcdf for a while. Currently, times are stored as integers or strings and an attribute is used to specify its semantics.

DennisHeimbigner avatar May 06 '20 15:05 DennisHeimbigner

Just as a note: This was also discussed in issue #131, and the data type names of the v3 spec were updated in PR #155 to be uint32, float64 etc… Date and time related datatypes are not part of v3 core at the moment but added as an extension.

jstriebel avatar Nov 16 '22 17:11 jstriebel

Also related to issue ( https://github.com/zarr-developers/numcodecs/issues/227 )

jakirkham avatar Nov 18 '22 09:11 jakirkham