kerchunk
kerchunk copied to clipboard
Compression and filters as properties of chunk instead of variable
It is possible that different netCDF/HDF5 files have different variable configuration. For example, for https://s3.amazonaws.com/era5-pds/2021/05/data/air_pressure_at_mean_sea_level.nc
, time0
is contiguous but for https://s3.amazonaws.com/era5-pds/2021/01/data/air_pressure_at_mean_sea_level.nc
it is chunked.
Although it is easy to simulate chunks from a contiguous variable, when setting the time0/.zattrs
properties, filters and compression will affect variables with different configurations and produce an error.
Would it be possible to implement compression and filters as properties of chunks instead of variables? It would be something like:
{
".zgroup": "{\n \"zarr_format\": 2\n}",
".zattrs": "{\n \"Conventions\": \"UGRID-0.9.0\n\"}",
"x/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n \"node\"\n ...",
"x/.zarray": "{\n \"chunks\": [\n 9228245\n ],\n \"dtype\": \"<f8\",\n ...",
"x/0": {
"store": "s3://bucket/path/file.nc",
"location": 294094376,
"size": 73825960,
"compressor": {...},
"filters": [...]
}
}
Example files:
https://s3.amazonaws.com/era5-pds/2021/06/data/air_pressure_at_mean_sea_level.nc
https://s3.amazonaws.com/era5-pds/2021/07/data/air_pressure_at_mean_sea_level.nc
The zarr API does not currently allow for compression/codecs to vary across an array. It is possible that it could be implemented (mentioned in long discussion in https://github.com/zarr-developers/zarr-python/pull/1131 ), in which case the parameters of the codec chain would need to be stored in the
Another possibility that would be quicker to achieve, would be to do per-chunk decoding in the storage layer, so that to zarr, the whole array appears "uncompressed", and we store the per-chunk codec specs in the references file. Your snippet hints at this, but we would probably quickly find that the JSON representation gets too bulky and we need parquet (in which the loading of specific columns can be skipped, along with other advantages). This is very tractable, but it amounts to some amount of moving code and logic from zarr into fsspec, as well as potentially making referenceFS more zarr-specific, which is a dubious prospect.