zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Represent the top-level `chunk_grid` as an "array -> files" codec

Open jbms opened this issue 6 months ago • 1 comments

Codecs in zarr v3 can do a lot of things, but they are still limited by the top-level chunk_grid. The entire codecs sequence operates on only a single chunk within the array, and ultimately must convert just that single chunk to a single byte sequence.

One specific limitation of this design is that array -> array codecs cannot be applied to the entire array (e.g. transpose the entire array), even though that would be well defined, and doesn't impose additional implementation difficulty beyond what is already required to support array -> array that occur before sharding_indexed.

Instead, we could make the top-level grid structure a codec also:

{
  "shape": [100, 100],
  "data_type": "int32",
  "codecs": [
    {"name": "transpose", "configuration": {"order": [2, 0, 1]}},
    {"name": "regular_grid",
     "configuration": {
       "chunk_shape": [10, 10],
       "chunk_key_encoding": "default",
       "chunk_codecs": [
         {"name": "bytes", {"configuration": {"endian": "little"}}},
         {"name": "gzip", {"configuration": {"level": 5}}}
       ]
     }
  ],
  ...
}

Under this new design, the top-level chunk_grid from our current zarr v3 is instead handled by a new type of codec, an "array -> files" codec. The new top-level codecs property would contain a list of zero or more "array -> array" codecs followed by exactly one "array -> files" codec. The chunk_codecs property listed above would then contain a list of zero or more array -> array codecs, followed by exactly one array -> bytes codecs, followed by zero or more bytes -> bytes codecs.

In addition to allowing "array -> array" codecs to be applied to the array as a whole (which, granted, could also easily be supported by adding a new top-level property like top_level_array_codecs), this would also allow individual chunks to be encoded as multiple files, without requiring a complicated interaction between a codec and a storage transformer. For example, you could write a sharding_indexed chunk as a separate data and index file, or a vlen chunk as a separate index and data file, or use a sparse array encoding where the coordinate information is stored in a separate file from the value information.

This is still a pretty speculative proposal at this point but I'd welcome feedback.

jbms avatar May 16 '25 20:05 jbms

In general this would make zarr into much more of a universal array representation --- for example you could have a tiledb-like format where files correspond to write operations (which may cover an arbitrary set of chunks) rather than individual chunks.

jbms avatar May 16 '25 20:05 jbms