default compression for embedded data
Currently, any embedded arrays are written to the output by zarr using default options. This means it will be blosc-compressed, which has downsides:
- the user needs to have blosc available
- for small arrays, the extra header and frame likely means bigger output
- the final output should be compressed anyway (Zstd works well for the large number of strings we usually encounter, or of course whatever would work well in parquet if we went in that direction)
Should we set the compression to None or something else?
@martindurant does this eventually open up the door to try uncompressed Parquet files too ? Or am I misunderstanding and it only concerns a write-out action of a Kerchunk index read via Xarray to a new Zarr ?
ps- Old relevant comment https://github.com/fsspec/kerchunk/issues/345#issuecomment-1806004597 ?
Is it too early to speak about https://github.com/Blosc/python-blosc2 ? There is no single reference of Blosc2 in the Kerchunk repository at the moment, it seems.
The available compressions are entirely up to upstream zarr - the IO layer doesn't do any compression by itself.
does this eventually open up the door to try uncompressed Parquet files too
I should have answered before sorry. This is totally possible right now. Parquet is complex though:
- pyarrow will not give you the metadata you need to find data buffers, you would need the felixibility of fastparquet and to scan the column data for the constituent pages
- there are various encoding options, and probably the idea is only applicable to PLAIN (although we could in theory have various codecs built perhaps off fastparquet code). Dict encoding is also very common, where the dictionary and values page are encoded separately.
- in v1 parquet pages, the nulls and repetition information is compressed along with the values, so indeed only uncompressed pages would be easily parsable (although the extra stuff would normally be a small part of the payload). Plus, what do you return in the case of nulls/repetitions?
Perhaps feather2 would be simpler to work with, although the flatbuffer parsing libraries are much weaker in python. Many of the same ideas.