numcodecs can VLenArray support 2D arrays

Right now I think the VLenArray only supports 1-D arrays. What would it take to extend support to 2-D arrays too? I have a list of 2D numpy arrays that I'd like to save as a zarr file using something like

zarr.array( 'foo', foo, dtype='array:f8')

where foo is my list of 2D numpy arrays (they are all NxD where D is fixed across the list but N is variable), but I currently get an error message that ends with

  File "numcodecs/vlen.pyx", line 382, in numcodecs.vlen.VLenArray.encode
ValueError: only 1-dimensional arrays are supported

If this sounds like a bad idea I can think of workarounds where I reshape flatten the arrays ahead of time and then keep track of that D number and reshape them on reading, but I thought I'd ask! Thanks!!

Sep 06 '19 04:09 sofroniewn

Hi @sofroniewn, I haven't thought about this deeply but I imagine the codec could be modified to be aware of the expected number of dimensions in each array, and then to encode the length of all dimensions. Currently the encode method encodes the data by interleaving the array lengths and the array buffers into a single contiguous buffer. So 2D arrays you would have to store 2 ints, then data, then 2 ints, then data, etc.

That would be relatively straightforward if all the arrays are 2D. It might get a git messier if you had a mix of arrays with different numbers of dimensions.

You may find it easier to flatten the arrays and keep track of array shapes separately and reshape on reading :-)

Sep 06 '19 07:09 alimanfoo

You may find it easier to flatten the arrays and keep track of array shapes separately and reshape on reading :-)

Btw don't mean this to sound discouraging, happy to help further if you think it's worth exploring changes to the codec.

Sep 06 '19 07:09 alimanfoo

@alimanfoo thanks for thinking about this. If you could do it in the codec with guaranteed 2D arrays then I'd strongly prefer that compared to having to do the flattening and reshaping on my end. It will make my apis much simpler and more consistent - sometimes I just have normal arrays, somethings I have these ragged arrays and the code will look much more similar if the codec can handle it.

One proposal might be to go all the way to the full general case and support a mix of arrays with different numbers of dimensions too, where we interleaved everything into a single continuous buffer when the first number was the number of dimensions, say D, then then next D numbers were there the shapes of each of the D dimensions, and then came the flattened array.

The change to the existing codec for 1-D arrays would be an additional 1 would appear at the beginning of every block. For my all 2-D arrays there would be an additional 2 at the beginning of every block, but this scheme would support the fully general case of mixing.

What are your thoughts? I'm new to the concepts in numcodecs so there might be things I'm not considering with this scheme.

Sep 06 '19 14:09 sofroniewn

On Fri, 6 Sep 2019 at 15:49, Nicholas Sofroniew [email protected] wrote:

@alimanfoo https://github.com/alimanfoo thanks for thinking about this. If you could do it in the codec with guaranteed 2D arrays then I'd strongly prefer that compared to having to do the flattening and reshaping on my end. It will make my apis much simpler and more consistent - sometimes I just have normal arrays, somethings I have these ragged arrays and the code will look much more similar if the codec can handle it.

One proposal might be to go all the way to the full general case and support a mix of arrays with different numbers of dimensions too, where we interleaved everything into a single continuous buffer when the first number was the number of dimensions, say D, then then next D numbers were there the shapes of each of the D dimensions, and then came the flattened array.

The change to the existing codec for 1-D arrays would be an additional 1 would appear at the beginning of every block. For my all 2-D arrays there would be an additional 2 at the beginning of every block, but this scheme would support the fully general case of mixing.

This approach sounds fine to me, we'd only need a single byte to store the number of dimensions, then 4 bytes for each dimension to store the lengths.

I might be inclined to code this up as a separate codec, rather than adapt the existing VLenArray codec, just because we would not have to worry about any data migration issues.

E.g., create a new codec class called something like VLenNDArray, with codec ID "vlen-ndarray".

Within Zarr we could then add a convenience to use dtype="ndarray:T", which is a shorthand for dtype=object, object_codec=numcodecs.VLenNDArray(T).

Sep 06 '19 15:09 alimanfoo

A new codec class VLenNDArray with convenience shorthands makes sense. I'm happy to give this a try myself - though as I said I'm new to the codebase, so any additional tips before I get started would be great if that's ok with you.

Sep 06 '19 15:09 sofroniewn

Cool, thank you. I'd start by copy-pasting the VLenArray codec class.

During encoding, the codec makes two passes over the input, one to collect the lengths, the second to write the data. So the first pass would need to be modified to also collect the number of dimensions for each array, and to store lengths for multiple dimensions. The second pass would then need to be modified to include writing out the number of dimensions and lengths of all dimensions.

During decoding, there are then some modifications to read back the number of dimensions and dimension lengths.

HTH, give me a shout if any questions about this or how to set up a dev environment to test locally.

On Fri, 6 Sep 2019 at 16:29, Nicholas Sofroniew [email protected] wrote:

A new codec class VLenNDArray with convenience shorthands makes sense. I'm happy to give this a try myself - though as I said I'm new to the codebase, so any additional tips before I get started would be great if that's ok with you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/numcodecs/issues/199?email_source=notifications&email_token=AAFLYQVHE5KXJ3JSHKNA7X3QIJZOVA5CNFSM4IUEW6KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6DGI6Q#issuecomment-528901242, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQTSJVY2DN323WUARJLQIJZOVANCNFSM4IUEW6KA .

--

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: [email protected] Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

Sep 06 '19 15:09 alimanfoo

@alimanfoo Could you take a look at @sofroniewn 's work (https://github.com/zarr-developers/numcodecs/pull/200)? He has been pinging people, but there seems to be no response from the Zarr developers. Which would be a shame for his hard work.

I'm interested in this functionality to store 2-channel audio recordings of varying length per recording.

Mar 23 '20 00:03 NumesSanguis

Hi @NumesSanguis, sorry for radio silence on this one, I've taken a look at the PR and seems good, added a few small comments.

Mar 25 '20 12:03 alimanfoo