zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

v2: clarify that unicode uses utf-32 encoding

Open constantinpape opened this issue 4 years ago • 8 comments
trafficstars

In data type encoding does not specify the unicode encoding. It appears that this is using UTF-32 (inherited from numpy unicode datatypes). Unfortunately this seems to be not clearly documented in the numpy dtype description as well, but inspection of the serialized data shows that it's UTF-32 encoded:

import numpy
print(numpy.dtype("U1").itemsize)  # prints 4

(and I have also validated by decoding zarr unicode chunks manually).

For supporting zarr unicode data in other languages this information is important, so it should be stated more explicitly in the spec.

constantinpape avatar Oct 01 '21 09:10 constantinpape

Ran into this today... glad to find this issue!

manzt avatar Jan 18 '22 00:01 manzt

Is this a spec issue or a python impl issue. I would assume the spec specifies utf-8

DennisHeimbigner avatar Feb 10 '22 20:02 DennisHeimbigner

@DennisHeimbigner the spec only states the following:

Simple data types are encoded within the array metadata as a string, following the [NumPy array protocol type string (typestr) format](https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface). The format consists of 3 parts:
.... 
"U": unicode (fixed-length sequence of Py_UNICODE)
...

and numpy uses UTF-32 encoding, see the code snippet above. By transitivity the spec currently uses utf-32 (but quite implicitly).

constantinpape avatar Feb 10 '22 20:02 constantinpape

I see. Should this be changed to be explicitly utf-8 or perhaps to include non-utf8 encodings, should a string be defined as a sequence of 8-bit bytes.

DennisHeimbigner avatar Feb 10 '22 20:02 DennisHeimbigner

I see. Should this be changed to be explicitly utf-8 or perhaps to include non-utf8 encodings, should a string be defined as a sequence of 8-bit bytes.

I don't think it can be changed to utf-8 in zarr spec v2; to be compatible with numpy it has to be utf-32 (and this is explicitly used as reference in the zarr-v2 spec). So for v2 I would just explicitly state that it's utf-32.

I am not so up-to-date on the current plans for v3, but it might be a good idea to decouple the data type encoding from numpy there; and also to change to UTF-8 by default and/or allow to specify the encoding sounds like a good idea.

constantinpape avatar Feb 10 '22 20:02 constantinpape

I wonder how many non-python zarr implementations adhere to using utf-32?

DennisHeimbigner avatar Feb 10 '22 20:02 DennisHeimbigner

Not sure how many other implementations even support it? A fixed-length sequence of utf-32-enocded code points seems unlikely to be particularly useful as a data type.

jbms avatar Feb 22 '22 07:02 jbms

https://github.com/zarr-developers/zarr-specs/pull/135/files#diff-6b08b9e843756eb493a5d6ad9817cb5aea38e09d80d1b84ddac2c5f3e37a246dR69 is the likely location for addressing this in v3. On the v2 front, @constantinpape, I assume our best next step is to get a simple test in zarr_implementations? If basically all implementations vary, it will be tricky to specify anything in the v2 spec.

joshmoore avatar Mar 07 '22 22:03 joshmoore