zarr-python
zarr-python copied to clipboard
v3 array creation: codecs
One thing about zarr v2 -> v3 that might surprise users is the change from the v2 compressor
metadata (a single thing) + filters
(an ordered collection) to the v3 codecs
metadata (an ordered collection with a special required element).
I suspect most users coming from v2 won't use array-array
or bytes-bytes
codecs. These users will think in terms of a single compressor for their data, if they worry about the compressor at all. For such users, the codecs
keyword argument in v3 array creation will be confusing, because a) it's not called "compressor", and b) it's an iterable. Users who do use filters will wonder where the filters
keyword argument went, and they will have to discover that their filters are now called "codecs", and these codecs should be prepended in front of the thing that used to be called the compressor.
I wonder if we could smooth out some of this confusion by adding an abstraction on top of the v3 codecs
metadata in our array creation routines, and returning to v2 terminology. Specifically, we could use the keyword "filters" to denote array-array codecs, "compressor" to denote the required array-bytes compressor, and introduce a new, v3-array-only keyword "post_compressor" to denote any bytes-bytes codecs. I'm not wedded to this name, feel free to suggest something better.
It would be an error to request a v2 array with a post-compressor, and otherwise the exact same keywords work for v2 and v3 array creation routines. Ergonomically this feels like an improvement and it would simplify today's chimeric AsyncArray.create
function, which is burdened with supporting mutually exclusive codecs
and compressor
/ filters
keyword arguments.
e.g.
def create(
shape,
dtype,
filters: Iterable[ArrayArrayCodec],
compressor: ArrayBytesCodec,
post_compressor: Iterable[BytesBytesCodec],
zarr_format, ...) -> AsyncArray
thoughts? Especially from people kicking the tires on the v3 array api (@rabernat)