kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Compression and filters as properties of chunk instead of variable

Open zequihg50 opened this issue 2 years ago • 1 comments

It is possible that different netCDF/HDF5 files have different variable configuration. For example, for https://s3.amazonaws.com/era5-pds/2021/05/data/air_pressure_at_mean_sea_level.nc, time0 is contiguous but for https://s3.amazonaws.com/era5-pds/2021/01/data/air_pressure_at_mean_sea_level.nc it is chunked.

Although it is easy to simulate chunks from a contiguous variable, when setting the time0/.zattrs properties, filters and compression will affect variables with different configurations and produce an error.

Would it be possible to implement compression and filters as properties of chunks instead of variables? It would be something like:

{
  ".zgroup": "{\n    \"zarr_format\": 2\n}",
  ".zattrs": "{\n    \"Conventions\": \"UGRID-0.9.0\n\"}",
  "x/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"node\"\n ...",
  "x/.zarray": "{\n    \"chunks\": [\n        9228245\n    ],\n   \"dtype\": \"<f8\",\n  ...",
  "x/0": {
    "store": "s3://bucket/path/file.nc",
    "location": 294094376,
    "size": 73825960,
    "compressor": {...},
    "filters": [...]
  }
}

Example files:

https://s3.amazonaws.com/era5-pds/2021/06/data/air_pressure_at_mean_sea_level.nc
https://s3.amazonaws.com/era5-pds/2021/07/data/air_pressure_at_mean_sea_level.nc

zequihg50 avatar Nov 23 '22 18:11 zequihg50

The zarr API does not currently allow for compression/codecs to vary across an array. It is possible that it could be implemented (mentioned in long discussion in https://github.com/zarr-developers/zarr-python/pull/1131 ), in which case the parameters of the codec chain would need to be stored in the

Another possibility that would be quicker to achieve, would be to do per-chunk decoding in the storage layer, so that to zarr, the whole array appears "uncompressed", and we store the per-chunk codec specs in the references file. Your snippet hints at this, but we would probably quickly find that the JSON representation gets too bulky and we need parquet (in which the loading of specific columns can be skipped, along with other advantages). This is very tractable, but it amounts to some amount of moving code and logic from zarr into fsspec, as well as potentially making referenceFS more zarr-specific, which is a dubious prospect.

martindurant avatar Nov 23 '22 18:11 martindurant