VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Custom codec information is silently dropped when writing kerchunk references

Open frazane opened this issue 5 months ago • 6 comments

I encountered this bug when working on a custom GRIB parser in https://github.com/MeteoSwiss/icon-ch-vzarr.

When trying to serialize virtual datasets as kerchunk references, the custom codec information is silently dropped. The array metadata looks like this (note the codecs entry):

ArrayV3Metadata(shape=(1, 1147980),
                data_type=Float64(endianness='little'),
                chunk_grid=RegularChunkGrid(chunk_shape=(1, 1147980)),
                chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
                                                           separator='/'),
                fill_value=np.float64(0.0),
                codecs=(EccodesCodec(),),
                attributes={'_earthkit': {'b64message': 'R1JJQv//AAIAAAAAAAAAqgAAABUBANcA/w8BAQfpAQEUAAABAQAAABwCAP4AB+kBARU0AwAAAAAAAAAAAAAAAAEAAAAjAwAAEYRMAAAAZQYAAAEBF2Q9oldJWbZE0lSjzW4rwAAAACIEAAAAAAAAAgCXAAAAAAAAAABnAAAAAAL///////8AAAAVBQAAAfAAAEOIgACACgAAAAAAAAAGBv8AAAAFBzc3Nzc=',
                                          'bitsPerValue': 16},
                            'long_name': '2m Temperature',
                            'standard_name': 'air_temperature',
                            'units': 'K'},
                dimension_names=('valid_time', 'values'),
                zarr_format=3,
                node_type='array',
                storage_transformers=())

but then the codecs (filter/compressors) information is not found in the references (it's null):

{"version":1,"refs":{".zgroup":"{\"zarr_format\":2}",".zattrs":"{}","T_2M\/0.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010120",3424926372,2296130],"T_2M\/1.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010121",3427222332,2296130],"T_2M\/2.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010122",3424926372,2296130],"T_2M\/3.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010123",3424926372,2296130],"T_2M\/.zarray":"{\"shape\":[4,1147980],\"chunks\":[1,1147980],\"dtype\":\"<f8\",\"fill_value\":0.0,\"order\":\"C\",\"filters\":null,\"dimension_separator\":\".\",\"compressor\":null,\"attributes\":{},\"zarr_format\":2}","T_2M\/.zattrs":"{\"standard_name\":\"air_temperature\",\"long_name\":\"2m Temperature\",\"units\":\"K\",\"_earthkit\":{\"bitsPerValue\":16,\"b64message\":\"R1JJQv\/\/AAIAAAAAAAAAqgAAABUBANcA\/w8BAQfpAQEUAAABAQAAABwCAP4AB+kBARU0AwAAAAAAAAAAAAAAAAEAAAAjAwAAEYRMAAAAZQYAAAEBF2Q9oldJWbZE0lSjzW4rwAAAACIEAAAAAAAAAgCXAAAAAAAAAABnAAAAAAL\/\/\/\/\/\/\/8AAAAVBQAAAfAAAEOIgACACgAAAAAAAAAGBv8AAAAFBzc3Nzc=\"},\"_ARRAY_DIMENSIONS\":[\"valid_time\",\"values\"]}","CLCL\/0.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010120",3721779870,2296130],"CLCL\/1.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010121",3724075830,2296130],"CLCL\/2.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010122",3721779870,2296130],"CLCL\/3.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010123",3721779870,2296130],"CLCL\/.zarray":"{\"shape\":[4,1147980],\"chunks\":[1,1147980],\"dtype\":\"<f8\",\"fill_value\":0.0,\"order\":\"C\",\"filters\":null,\"dimension_separator\":\".\",\"compressor\":null,\"attributes\":{},\"zarr_format\":2}","CLCL\/.zattrs":"{\"standard_name\":\"unknown\",\"long_name\":\"Cloud Cover (800 hPa - Soil)\",\"units\":\"%\",\"_earthkit\":{\"bitsPerValue\":16,\"b64message\":\"R1JJQv\/\/AAIAAAAAAAAAqgAAABUBANcA\/w8BAQfpAQEUAAABAQAAABwCAP4AB+kBARU0BAAAAAAAAAAAAAAAAAEAAAAjAwAAEYRMAAAAZQYAAAEBF2Q9oldJWbZE0lSjzW4rwAAAACIEAAAAAAYWAgCXAAAAAAAAAABkAAABOIABAAAAAAAAAAAVBQAAAfAAAEOIgACACgAAAAAAAAAGBv8AAAAFBzc3Nzc=\"},\"_ARRAY_DIMENSIONS\":[\"valid_time\",\"values\"]}","TOT_PREC\/0.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010120",3792671464,194],"TOT_PREC\/1.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010121",3794978942,194],"TOT_PREC\/2.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010122",3792683114,194],"TOT_PREC\/3.0":["\/store_new\/mch\/msopr\/osm\/KENDA-CH1\/ANA25\/det\/iaf2025010123",3792680428,194],"TOT_PREC\/.zarray":"{\"shape\":[4,1147980],\"chunks\":[1,1147980],\"dtype\":\"<f8\",\"fill_value\":0.0,\"order\":\"C\",\"filters\":null,\"dimension_separator\":\".\",\"compressor\":null,\"attributes\":{},\"zarr_format\":2}","TOT_PREC\/.zattrs":"{\"standard_name\":\"unknown\",\"long_name\":\"Total Precipitation (Accumulation)\",\"units\":\"kg m-2\",\"_earthkit\":{\"bitsPerValue\":0,\"b64message\":\"R1JJQv\/\/AAIAAAAAAAAAwgAAABUBANcA\/w8BAQfpAQEUAAABAQAAABwCAP4AB+kBARU0BQAAAAAAAAAAAAAAAAEAAAAjAwAAEYRMAAAAZQYAAAEBF2Q9oldJWbZE0lSjzW4rwAAAADoEAAAACAE0AgCXAAAAAAAAAAABAAAAAAD\/\/\/\/\/\/\/8H6QEBFAAAAQAAAAABAgAAAAAA\/wAAAAAAAAAVBQAAAfAAAEOIgACACgAAAAAAAAAGBv8AAAAFBzc3Nzc=\"},\"_ARRAY_DIMENSIONS\":[\"valid_time\",\"values\"]}","valid_time\/0":"base64:AAAAAAAAAAABAAAAAAAAAAIAAAAAAAAAAwAAAAAAAAA=","valid_time\/.zarray":"{\"shape\":[4],\"chunks\":[4],\"dtype\":\"<i8\",\"fill_value\":null,\"order\":\"C\",\"filters\":null,\"dimension_separator\":\".\",\"compressor\":null,\"attributes\":{},\"zarr_format\":2}","valid_time\/.zattrs":"{\"units\":\"hours since 2025-01-01 20:00:00\",\"calendar\":\"proleptic_gregorian\",\"_ARRAY_DIMENSIONS\":[\"valid_time\"]}"}}

Issue seems to be here: https://github.com/zarr-developers/VirtualiZarr/blob/f3149d6464fa2e88c01c71d03170993f93bb3e8c/virtualizarr/utils.py#L138 if my custom codec is a subclass of ArrayBytesCodec it will be excluded. There's also a TODO left there which may refer to this.


I added a reproducible example here (run it with uv run virtualize_kenda.py): https://gist.github.com/frazane/d26fd8925aea11cadf5bb012d81c5c2e

frazane avatar Aug 12 '25 08:08 frazane

Thanks for raising this @frazane !

We clearly need to pass this information in.

Just to clarify - it looks like the part of the VirtualiZarr code you're referring to is for non-virtual variables (i.e. not ManifestArrays). Is that what you meant?

The next step here would be to create a minimum reproducible example, ideally by just artificially creating some data with a single compressor or filter.

TomNicholas avatar Aug 12 '25 20:08 TomNicholas

@TomNicholas

Just to clarify - it looks like the part of the VirtualiZarr code you're referring to is for non-virtual variables (i.e. not ManifestArrays). Is that what you meant?

Ops, no, that was a mistake. The issue is elsewhere, likely here https://github.com/zarr-developers/VirtualiZarr/blob/f3149d6464fa2e88c01c71d03170993f93bb3e8c/virtualizarr/utils.py#L138

I added a "minimal" reproducible example here: https://gist.github.com/frazane/d26fd8925aea11cadf5bb012d81c5c2e. You can run it directly with uv run virtualize_kenda.py.

frazane avatar Aug 13 '25 09:08 frazane

When I try to run that example I get this error:

  File "/Users/tom/.cache/uv/environments-v2/virtualize-kenda-kerchunk-fffb73731867bdee/lib/python3.12/site-packages/zarr/codecs/_v2.py", line 51, in _decode_single
    chunk = chunk.view(chunk_spec.dtype.to_native_dtype())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.

Is that what you expect me to see?

TomNicholas avatar Aug 14 '25 14:08 TomNicholas

FYI @frazane I would like to fix this but I'm going to release VZ now without this fix just because there are multiple other important fixes to get out there.

TomNicholas avatar Aug 14 '25 14:08 TomNicholas

Is that what you expect me to see?

Yes, that is the same exception I get.

frazane avatar Aug 15 '25 07:08 frazane

Hi @TomNicholas, are there any news regarding this issue? (I am not in need of an urgent fix, just happened to remember about it)

frazane avatar Nov 29 '25 13:11 frazane