kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Schema inconsistency across backends - some use `refs` as the top-level key, some don't

Open TomNicholas opened this issue 7 months ago • 2 comments

(Raising something we found a while ago here, original issue is https://github.com/zarr-developers/VirtualiZarr/issues/160#issuecomment-2189907601)

tl;dr: There's a schema inconsistency across kerchunk backends - some use refs as the top-level key, some don't.

The output of kerchunk.tiff.tiff_to_zarr(url) looks like

{
  '.zgroup': '{\n "zarr_format": 2\n}',
  '.zattrs': '{"multiscales":[{"datasets":[{"path":"0"},{"path":"1"},{"path":"2"}],"metadata":{},"name":"","version":"0.1"}],"OVR_RESAMPLING_ALG":"NEAREST","LAYOUT":"IFDS_BEFORE_DATA","BLOCK_ORDER":"ROW_MAJOR","BLOCK_LEADER":"SIZE_AS_UINT4","BLOCK_TRAILER":"LAST_4_BYTES_REPEATED","KNOWN_INCOMPATIBLE_EDITION":"NO","KeyDirectoryVersion":1,"KeyRevision":1,"KeyRevisionMinor":0,"GTModelTypeGeoKey":1,"GTRasterTypeGeoKey":1,"GTCitationGeoKey":"Albers","GeographicTypeGeoKey":4326,"GeogCitationGeoKey":"WGS 84","GeogAngularUnitsGeoKey":9102,"GeogSemiMajorAxisGeoKey":6378140.0,"GeogInvFlatteningGeoKey":298.256999999996,"ProjectedCSTypeGeoKey":32767,"ProjectionGeoKey":32767,"ProjCoordTransGeoKey":11,"ProjLinearUnitsGeoKey":9001,"ProjStdParallel1GeoKey":29.5,"ProjStdParallel2GeoKey":45.5,"ProjNatOriginLongGeoKey":-96.0,"ProjNatOriginLatGeoKey":23.0,"ProjFalseEastingGeoKey":0.0,"ProjFalseNorthingGeoKey":0.0,"ModelPixelScale":[30.0,30.0,0.0],"ModelTiepoint":[0.0,0.0,0.0,-1801185.0,2700405.0,0.0]}',
  '0/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n  "Y",\n  "X"\n ]\n}',
  '0/.zarray': '{\n "chunks": [\n  512,\n  512\n ],\n "compressor": {\n  "id": "zlib"\n },\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  2048,\n  2048\n ],\n "zarr_format": 2\n}',
  ...,
}

It looks like this is not the same structure that e.g. kerchunk.hdf.SingleHdf5ToZarr returns.

What virtualizarr expects (and what the kerchunk docs promise...) is that the keys of the outermost dictionary are 'refs' and 'version'. This kerchunk.tiff.tiff_to_zarr(url) function seems to have jumped straight to giving us the contents that would normally be underneath the 'refs' key.

This is an inconsistency in the schema, and an example of kerchunk not obeying it's own specification. It also seems to provide no benefit as far as I can tell.

In VirtualiZarr we simply worked around it by special-casing tiffs to add that top-level {'refs': ...} ourselves (so this is not at all urgent for us, I'm just raising this for completeness), but in theory it should really be fixed here. It would be a breaking change for kerchunk though.

TomNicholas avatar Jun 17 '25 10:06 TomNicholas