Extending ChunkManifest to contain 32 KB initialization data for decompression algorithms
A new sub-chunking strategy for fast netCDF-4 access in local, remote and cloud infrastructures was a super interesting poster at EGU, which creates sub-chunks for NetCDF files. The approach seems similar to indexed_gzip which stores a 32 KB seed for the compression algorithm for each chunk.
@CedricPenard and Flavien Gouillon are interested in integrating this functionality into VirtualiZarr. In order to do this, we would need an additional array in the chunk manifest containing the data for initialization of the decompression algorithm. Can we support that in ChunkManifest and similarly could Icechunk support serializing that additional chunk-level metadata?
I really think this would be a total game changer. Of course 32 KB per chunk is a big cost, but many folks don't want to change their data production workflows and this would unlock virtualizing at a minimum:
- Most obviously, GZIP, Deflate, etc. compressed NetCDFs that have too big of chunks for optimal Zarr access
- GZIP, Deflate, etc. compressed ZIP files (e.g., Sentinel SAFE - as one example OLCI)
- Zarr V2 data as sharded V3 data
- NFTI datasets - the motivation for indexed_gzip
Thanks for raising this @maxrjones . I think I met @cedricpenard and Flavien at AGU in December 👋
an additional array in the chunk manifest
This could potentially be done. It's arguably an example of https://github.com/zarr-developers/VirtualiZarr/issues/246.
However I need a bit more context to properly understand what is being proposed here:
- My recollection was that the purpose of sub-chunking was to allow indexing inside chunks, useful for extracting data in an access pattern orthogonal to the original chunking (e.g. timeseries vs spatial pancakes). But your description makes it sound like it allows us to virtualize file formats with compression that we couldn't otherwise support. Which of those is true?
- I didn't realise supporting this had any connection to any other file format. Can you explain what the common thread is?
- Does the netCDF sub-chunking work on any netCDF file? Is it only for specially-created netCDF files? (The link https://github.com/CNES/netCDFchunkindex does not have a useful readme to help me understand this.)
- What exactly would Zarr / Icechunk have to do differently at read-time to support this?
Hi, The documentation at https://github.com/CNES/netCDFchunkindex is not ready yet. I will complete it in the coming weeks. Sub-chunking works with any netCDF file. The goal is to ensure that the original file structure and format remain untouched. It only creates a new index file (in netCDF format). The purpose is to improve data extraction in an access pattern that is not optimal for the original chunking.