kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Kerchunk and Zarr V3

Open jhamman opened this issue 3 years ago • 7 comments

The Zarr V3 spec is now undergoing public review and testing. This issue raises the question of how Kerchunk should integrate with the new spec.

Key changes in the V3 spec that are relevant particularly to Kerchunk (https://github.com/zarr-developers/zarr-specs/pull/149):

  1. change to chunk and metadata key names
  2. introduction of storage transformer extensions
  3. likely introduction of a Sharding Storage Transformer extension: https://github.com/zarr-developers/zarr-specs/pull/152

Questions:

  1. Has any work been done to produce kerchunk references that align with the v3 storage key conventions?
  2. Could Kerchunk provide be thought of as a storage transformer in v3?
  3. The Sharding proposal (linked above) includes some references to putting shards in hdf5 files. Could Kerchunk extend that spec?

jhamman avatar Oct 18 '22 15:10 jhamman

Some of this I'll have to think about, but some things I can answer immediately.

  • Kerchunk could produce v3 reference sets right now, and indeed convert v2<->v3 no problem, since it's only a rearrangement of paths. I don't think this would come with any benefit, though. No work has been done.
  • I am not sure kerchunk can be a storage transformer rather than a storage provider. If yes, I don't see why it would be beneficial in itself. There would need to be more done in that transformer to be worth it.
  • Yes, there is a thought to providing shards via kerchunk ( https://github.com/fsspec/kerchunk/issues/134 and preffs ) in a manner similar to but independent of the sharding spec.

I also want to mention that kerchunk should be useful for more than just zarr, so I will tend to favour things being coded in the storage layer rather than zarr-specific extensions. For example, reordering and selecting parquet files without touching the originals is something that kerchunk can do now. If you wanted full tabular iceberg compatibility using kerchunk/referenceFS, one could implement that now without too much trouble.

Here is the simplest non-zarr idea for CSVs: https://github.com/fsspec/kerchunk/issues/66 (and, more generally, random access of delimited/block compressed data).

martindurant avatar Oct 18 '22 15:10 martindurant

An application for this would be backporting Zarr v3 shards for availability via Zarr v2.

mkitti avatar Jul 22 '24 18:07 mkitti

? I thought one of the main reasons for having a V3 at all was so that we could have new things like sharding ?

That is presumably why my working variable-chunking implementation for v2 was not given consideration.

martindurant avatar Jul 22 '24 18:07 martindurant