Custom codec information is silently dropped when writing kerchunk references
I encountered this bug when working on a custom GRIB parser in https://github.com/MeteoSwiss/icon-ch-vzarr.
When trying to serialize virtual datasets as kerchunk references, the custom codec information is silently dropped. The array metadata looks like this (note the codecs entry):
ArrayV3Metadata(shape=(1, 1147980),
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(1, 1147980)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
fill_value=np.float64(0.0),
codecs=(EccodesCodec(),),
attributes={'_earthkit': {'b64message': 'R1JJQv//AAIAAAAAAAAAqgAAABUBANcA/w8BAQfpAQEUAAABAQAAABwCAP4AB+kBARU0AwAAAAAAAAAAAAAAAAEAAAAjAwAAEYRMAAAAZQYAAAEBF2Q9oldJWbZE0lSjzW4rwAAAACIEAAAAAAAAAgCXAAAAAAAAAABnAAAAAAL///////8AAAAVBQAAAfAAAEOIgACACgAAAAAAAAAGBv8AAAAFBzc3Nzc=',
'bitsPerValue': 16},
'long_name': '2m Temperature',
'standard_name': 'air_temperature',
'units': 'K'},
dimension_names=('valid_time', 'values'),
zarr_format=3,
node_type='array',
storage_transformers=())
but then the codecs (filter/compressors) information is not found in the references (it's null):
{"version":1,"refs":{".zgroup":"{\"zarr_format\":2}",".zattrs":"{}","T_2M\/0.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010120",3424926372,2296130],"T_2M\/1.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010121",3427222332,2296130],"T_2M\/2.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010122",3424926372,2296130],"T_2M\/3.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010123",3424926372,2296130],"T_2M\/.zarray":"{\"shape\":[4,1147980],\"chunks\":[1,1147980],\"dtype\":\"<f8\",\"fill_value\":0.0,\"order\":\"C\",\"filters\":null,\"dimension_separator\":\".\",\"compressor\":null,\"attributes\":{},\"zarr_format\":2}","T_2M\/.zattrs":"{\"standard_name\":\"air_temperature\",\"long_name\":\"2m Temperature\",\"units\":\"K\",\"_earthkit\":{\"bitsPerValue\":16,\"b64message\":\"R1JJQv\/\/AAIAAAAAAAAAqgAAABUBANcA\/w8BAQfpAQEUAAABAQAAABwCAP4AB+kBARU0AwAAAAAAAAAAAAAAAAEAAAAjAwAAEYRMAAAAZQYAAAEBF2Q9oldJWbZE0lSjzW4rwAAAACIEAAAAAAAAAgCXAAAAAAAAAABnAAAAAAL\/\/\/\/\/\/\/8AAAAVBQAAAfAAAEOIgACACgAAAAAAAAAGBv8AAAAFBzc3Nzc=\"},\"_ARRAY_DIMENSIONS\":[\"valid_time\",\"values\"]}","CLCL\/0.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010120",3721779870,2296130],"CLCL\/1.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010121",3724075830,2296130],"CLCL\/2.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010122",3721779870,2296130],"CLCL\/3.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010123",3721779870,2296130],"CLCL\/.zarray":"{\"shape\":[4,1147980],\"chunks\":[1,1147980],\"dtype\":\"<f8\",\"fill_value\":0.0,\"order\":\"C\",\"filters\":null,\"dimension_separator\":\".\",\"compressor\":null,\"attributes\":{},\"zarr_format\":2}","CLCL\/.zattrs":"{\"standard_name\":\"unknown\",\"long_name\":\"Cloud Cover (800 hPa - Soil)\",\"units\":\"%\",\"_earthkit\":{\"bitsPerValue\":16,\"b64message\":\"R1JJQv\/\/AAIAAAAAAAAAqgAAABUBANcA\/w8BAQfpAQEUAAABAQAAABwCAP4AB+kBARU0BAAAAAAAAAAAAAAAAAEAAAAjAwAAEYRMAAAAZQYAAAEBF2Q9oldJWbZE0lSjzW4rwAAAACIEAAAAAAYWAgCXAAAAAAAAAABkAAABOIABAAAAAAAAAAAVBQAAAfAAAEOIgACACgAAAAAAAAAGBv8AAAAFBzc3Nzc=\"},\"_ARRAY_DIMENSIONS\":[\"valid_time\",\"values\"]}","TOT_PREC\/0.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010120",3792671464,194],"TOT_PREC\/1.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010121",3794978942,194],"TOT_PREC\/2.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010122",3792683114,194],"TOT_PREC\/3.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010123",3792680428,194],"TOT_PREC\/.zarray":"{\"shape\":[4,1147980],\"chunks\":[1,1147980],\"dtype\":\"<f8\",\"fill_value\":0.0,\"order\":\"C\",\"filters\":null,\"dimension_separator\":\".\",\"compressor\":null,\"attributes\":{},\"zarr_format\":2}","TOT_PREC\/.zattrs":"{\"standard_name\":\"unknown\",\"long_name\":\"Total Precipitation (Accumulation)\",\"units\":\"kg m-2\",\"_earthkit\":{\"bitsPerValue\":0,\"b64message\":\"R1JJQv\/\/AAIAAAAAAAAAwgAAABUBANcA\/w8BAQfpAQEUAAABAQAAABwCAP4AB+kBARU0BQAAAAAAAAAAAAAAAAEAAAAjAwAAEYRMAAAAZQYAAAEBF2Q9oldJWbZE0lSjzW4rwAAAADoEAAAACAE0AgCXAAAAAAAAAAABAAAAAAD\/\/\/\/\/\/\/8H6QEBFAAAAQAAAAABAgAAAAAA\/wAAAAAAAAAVBQAAAfAAAEOIgACACgAAAAAAAAAGBv8AAAAFBzc3Nzc=\"},\"_ARRAY_DIMENSIONS\":[\"valid_time\",\"values\"]}","valid_time\/0":"base64:AAAAAAAAAAABAAAAAAAAAAIAAAAAAAAAAwAAAAAAAAA=","valid_time\/.zarray":"{\"shape\":[4],\"chunks\":[4],\"dtype\":\"<i8\",\"fill_value\":null,\"order\":\"C\",\"filters\":null,\"dimension_separator\":\".\",\"compressor\":null,\"attributes\":{},\"zarr_format\":2}","valid_time\/.zattrs":"{\"units\":\"hours since 2025-01-01 20:00:00\",\"calendar\":\"proleptic_gregorian\",\"_ARRAY_DIMENSIONS\":[\"valid_time\"]}"}}
Issue seems to be here: https://github.com/zarr-developers/VirtualiZarr/blob/f3149d6464fa2e88c01c71d03170993f93bb3e8c/virtualizarr/utils.py#L138
if my custom codec is a subclass of ArrayBytesCodec it will be excluded. There's also a TODO left there which may refer to this.
I added a reproducible example here (run it with uv run virtualize_kenda.py): https://gist.github.com/frazane/d26fd8925aea11cadf5bb012d81c5c2e
Thanks for raising this @frazane !
We clearly need to pass this information in.
Just to clarify - it looks like the part of the VirtualiZarr code you're referring to is for non-virtual variables (i.e. not ManifestArrays). Is that what you meant?
The next step here would be to create a minimum reproducible example, ideally by just artificially creating some data with a single compressor or filter.
@TomNicholas
Just to clarify - it looks like the part of the VirtualiZarr code you're referring to is for non-virtual variables (i.e. not ManifestArrays). Is that what you meant?
Ops, no, that was a mistake. The issue is elsewhere, likely here https://github.com/zarr-developers/VirtualiZarr/blob/f3149d6464fa2e88c01c71d03170993f93bb3e8c/virtualizarr/utils.py#L138
I added a "minimal" reproducible example here: https://gist.github.com/frazane/d26fd8925aea11cadf5bb012d81c5c2e. You can run it directly with uv run virtualize_kenda.py.
When I try to run that example I get this error:
File "/Users/tom/.cache/uv/environments-v2/virtualize-kenda-kerchunk-fffb73731867bdee/lib/python3.12/site-packages/zarr/codecs/_v2.py", line 51, in _decode_single
chunk = chunk.view(chunk_spec.dtype.to_native_dtype())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.
Is that what you expect me to see?
FYI @frazane I would like to fix this but I'm going to release VZ now without this fix just because there are multiple other important fixes to get out there.
Is that what you expect me to see?
Yes, that is the same exception I get.
Hi @TomNicholas, are there any news regarding this issue? (I am not in need of an urgent fix, just happened to remember about it)