numcodecs.js icon indicating copy to clipboard operation
numcodecs.js copied to clipboard

VLenUtf8 Support

Open ilan-gold opened this issue 3 years ago • 7 comments

VlenUtf8 is a common codec for string arrays, and porting it should be relatively straightforward: https://github.com/zarr-developers/numcodecs/blob/2c1aff98e965c3c4747d9881d8b8d4aad91adb3a/numcodecs/vlen.pyx#L48-L178

I'm working on doing this for Vitessce, so if you're interested let me know!

ilan-gold avatar May 24 '21 17:05 ilan-gold

Hmmm, it seems that this is not a codec but a "filter." Does this belong in zarr.js then?

ilan-gold avatar May 24 '21 18:05 ilan-gold

Seems to work well: https://github.com/vitessce/vitessce/pull/948/files

ilan-gold avatar May 25 '21 13:05 ilan-gold

Can contribute if you're interested but not sure how you want to set up filters here/zarr.js

ilan-gold avatar May 25 '21 13:05 ilan-gold

I think it makes sense to add filters to numcodecs.js (that's where they live for zarr-python, and they implement the codecs interface). However, currently zarr.js doesn't support using filters. That alone should be straigh-forward to add (essentially decode a chunk and then run the decoded chunk through a filter codec); however, the real issue is the "dtype" itself here.

Zarr.js only supports (numeric) dtypes that have an analogous TypedArray. There are no variably sized TypedArrays in JavaScript so the decoded data would need to lives in a JavaScript Array. Zarr.js relies on TypedArray APIs in both RawArray and NestedArray, so it would be tricky to add a dtype that currently isn't supported.

manzt avatar May 26 '21 18:05 manzt

@manzt I'm not as familiar with this so I defer to you here. Would it make sense to create a new typed array like StringArray? Or some sort of catch-all for non-recognized types? This is definitely out of my wheelhouse for me so if you want to come up with a roadmap here, I can help fill in with PR's etc.

ilan-gold avatar May 26 '21 18:05 ilan-gold

Thank you for opening this issue. I'm looking into this issue to implement the support for zarr.js. I had a call on this topic with the develop @gzuidhof.

I looked at some details on how this is done in the Python package numcodecs with vlenutf8 support and noticed the following things:

  • There are multiple ways to store strings as variables in numpy. Bytes/ascii strings are stored as type 'S', unicode is stored as type 'U', objects are stored as type 'O'.
  • In the vlenutf8 implementation even if strings are of fixed length (e.g. dtype '<U3') the strings are converted to objects . The same is true for lists of strings, these are also converted to numpy arrays of the object type.
  • Arrays in numcodecs python implementation are flattened to 1 dimension reshape(-1). So even if you pass multidimensional arrays, the result will be 1 dimensional.
  • The resulting 1d object array does implement a buffer interface (similar to ArrayBuffer). But this interface is not used.
  • Instead the encoding is done using a header (documented as the "parquet" approach) which records the number of elements. Each elements is stored as the number of bytes, followed by the utf8 encoded bytes.
  • In python zarr the codecs for variable length types are implemented as filters.

Other things to note are:

  • Python objects arrays also implement the buffer protocol, similar to the javascript ArrayBuffer. Javascript only implements arraybuffer for numeric and byte like types. Which makes the interface (bytes in, bytes out) generic.
  • Numcodecs in python also supports other buffer to object codecs (json, msgpack and pickle). In zarr python these codecs are implemented as “object_codec”. The msgpack and json codecs might also be needed at some point. This would require an Array to be returned rather than a string array. Based on the above I would suggest to add “Array” type as input for the codecs . For the vlen8 this would be implemented as an array of strings.

Based on the assumption that the implementation approach should follow the python numcodecs implementation. I would suggest to do the following roadmap:

  • [ ] Extend the result of the encoding output and decoding input for this decoder with the implementation based on vitesce’s implementation. I tested this separate and it works well.
  • [ ] Add “Array” as an input and output type to the decoder/encoder interface.
  • [ ] Implement filters support in zarr.js

We're glad to contribute in any of the subtasks.

SiggyF avatar May 10 '22 12:05 SiggyF

Following on this. Is this something that the developers are still interested? I might contribute with this one.

h-mayorquin avatar Apr 24 '24 16:04 h-mayorquin