zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Supporting UTF-8 data type

Open jakirkham opened this issue 4 years ago • 12 comments

In today's discussion the need for UTF-8 came up. Thought we already had an issue for this, but am not finding it.

Would be useful to have UTF-8 support in the spec or as a high priority extension. Raising here to start the discussion about how we want to approach this.

cc @joshmoore @alimanfoo @shoyer @Carreau

jakirkham avatar Jul 01 '20 20:07 jakirkham

As I noted in the call, I think how HDF5 supports strings (including UTF-8) is pretty sane:

  • Strings data types always have an explicit encoding, which can be either ascii or utf8.
  • Strings data types are either fixed width (which refers to the number of bytes used in the encoded representation, not the number of unicode characters like in Python) or variable width

I'm not sure there's a real use for ascii these days (given that it's a strict subset of utf8), but there are certainly use cases for both fixed width and variable width utf8 strings.

shoyer avatar Jul 01 '20 22:07 shoyer

I think having a utf8 string type is very important for v3.

I would also be a strong proponent of a variable length utf-8, as most text data is variable length.

I am concerned by the current specs use of fixed length utf-32, since it's an uncommon encoding with little support beyond numpy.

My ideal scenario would be to have the string extension spec essentially use arrows string type encoding specification, e.g. a string is a variable length list of bytes (docs on layout). This means the chunk would include multiple buffers, including an offset buffer and a data buffer. Arrow also includes information about validity for null values – which is nice but I'm not sure necessary.

For expediency, it could make sense to include fixed length utf8 strings as an extension in zarr v3. I'm not sure I would update the AnnData formats to zarr v3 until variable length strings existed, since I'd rather not go back to the issues we had with fixed length strings. E.g. I would really like to kerchunk together arrays of labels, and labels vary widely in size.


@DennisHeimbigner, we briefly talked about this at the end of the last zarr call, though I hadn't had a chance to read the spec yet. You had mentioned varlength was proposed, but was that in an issue/ PR?

ivirshup avatar Dec 06 '22 19:12 ivirshup

I agree --- I would also like to see variable length byte sequence and variable length Unicode code point sequence as data types.

I believe the existing fixed length string type extensions are definitely not intended to be part of the core spec. They were added to document the existing zarr v2 behavior, and haven't been reviewed too much. Despite the fact that they don't seem terribly useful, I also don't think they are unreasonable to have as optional extensions.

jbms avatar Dec 06 '22 20:12 jbms

I agree --- I would also like to see variable length byte sequence and variable length Unicode code point sequence as data types.

A point that is a little confusing to me right now is "core", "extension", or "extension but on zarr-specs.readthedocs.io". Which were you thinking for these types?

I also don't think they are unreasonable to have as optional extensions.

I agree these aren't unreasonable by themselves. I think it might be bad if utf-32 were the only unicode representation for v3 on zarr-specs.

ivirshup avatar Dec 06 '22 20:12 ivirshup

I think we still have to sort out exactly how extensions and other additions of features in later spec versions will be specified in the metadata.

But I certainly agree that the utf-32 encoding is not very useful.

jbms avatar Dec 06 '22 22:12 jbms

I'd like to add my vote for adding support for variable-length strings in v3. We need this for supporting Zarr v3 in sgkit's VCF Zarr support (see https://github.com/sgkit-dev/bio2zarr/issues/254).

The way we are using it currently in v2 is the way that's recommended in the Zarr Tutorial:

>>> import numcodecs
>>> import zarr.v2 as zarr
>>> z = zarr.array(["Hi", "Hey"], dtype=object, object_codec=numcodecs.VLenUTF8())
>>> z
<zarr.v2.core.Array (2,) object>
>>> z[:]
array(['Hi', 'Hey'], dtype=object)

Perhaps Zarr v3 should take advantage of the new NumPy UTF-8 variable-width string dtype for this?

tomwhite avatar Jul 09 '24 11:07 tomwhite

I'm not too familiar with numpy string arrays but my impression is that an array of a variable-length type cannot use a contiguous memory buffer for the in-memory representation. As zarr-python v3 internal APIs are very much centered around contiguous memory buffers, this might be a challenge!

@normanrz do you have any insight into how variable length types would fit into the current chunk processing framework in zarr python v3?

d-v-b avatar Jul 09 '24 12:07 d-v-b

I think adding variable-length strings to zarr-python would take some work but is not impossible. The numpy-backed buffers are still quite flexible. We use them for handling the object dtype in v2 arrays as well. Other buffers might need more work.

normanrz avatar Jul 09 '24 12:07 normanrz

Perhaps Zarr v3 should take advantage of the new NumPy UTF-8 variable-width string dtype for this?

I don't think this is much help for Zarr, because "string data are stored outside the array buffer" (see https://numpy.org/neps/nep-0055-string_dtype.html#serialization), i.e. the arrays just stores a pointer to the actual string data.

A much better reference point would be Arrow string encoding, or more generally, Arrow variable sized binary layout. Variable-length types require at least two buffers: one to store the actual data and one to store offsets into the data where the items begin.

We already support all of this in Zarr V2 via numcodecs vlen codecs! https://numcodecs.readthedocs.io/en/stable/vlen.html

Shouldn't it be straightforward to adapt this approach to V3? They key will be to not rely on anything Python specific (e.g. python objects). Arrow points the way here.

rabernat avatar Jul 09 '24 13:07 rabernat

Just adding my +1 to @tomwhite's comment above. Strings are crucial for supporting genetic variation data, which there is an awful lot of, and Zarr would be amazing for. See our preprint for background and details.

jeromekelleher avatar Jul 09 '24 14:07 jeromekelleher

I think this issue needs a champion who wants to write a ZEP.

normanrz avatar Jul 10 '24 13:07 normanrz

Over at https://github.com/zarr-developers/zarr-python/pull/2031 I have a proof-of-concept that we can very easily support UTF-8 and variable length strings by leveraging Arrow encoding of string arrays. Would love some feedback on whether that approach seems promising.

rabernat avatar Jul 12 '24 22:07 rabernat