zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

v3 array creation: codecs

Open d-v-b opened this issue 8 months ago • 0 comments

One thing about zarr v2 -> v3 that might surprise users is the change from the v2 compressor metadata (a single thing) + filters (an ordered collection) to the v3 codecs metadata (an ordered collection with a special required element).

I suspect most users coming from v2 won't use array-array or bytes-bytes codecs. These users will think in terms of a single compressor for their data, if they worry about the compressor at all. For such users, the codecs keyword argument in v3 array creation will be confusing, because a) it's not called "compressor", and b) it's an iterable. Users who do use filters will wonder where the filters keyword argument went, and they will have to discover that their filters are now called "codecs", and these codecs should be prepended in front of the thing that used to be called the compressor.

I wonder if we could smooth out some of this confusion by adding an abstraction on top of the v3 codecs metadata in our array creation routines, and returning to v2 terminology. Specifically, we could use the keyword "filters" to denote array-array codecs, "compressor" to denote the required array-bytes compressor, and introduce a new, v3-array-only keyword "post_compressor" to denote any bytes-bytes codecs. I'm not wedded to this name, feel free to suggest something better.

It would be an error to request a v2 array with a post-compressor, and otherwise the exact same keywords work for v2 and v3 array creation routines. Ergonomically this feels like an improvement and it would simplify today's chimeric AsyncArray.create function, which is burdened with supporting mutually exclusive codecs and compressor / filters keyword arguments.

e.g.

def create(
  shape, 
  dtype, 
  filters: Iterable[ArrayArrayCodec], 
  compressor: ArrayBytesCodec, 
  post_compressor: Iterable[BytesBytesCodec], 
  zarr_format, ...) -> AsyncArray

thoughts? Especially from people kicking the tires on the v3 array api (@rabernat)

d-v-b avatar Jun 02 '24 12:06 d-v-b