kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Process/Decode Chunk Issue

Open dwest77a opened this issue 1 year ago • 5 comments

I have some NetCDF UKCP data with a variable called "yyyymmdd" that is stored in the Kerchunk file like so:

"yyyymmdd/.zarray": "{\"chunks\":[1,64],\"compressor\":null,\"dtype\":\"|S1\",\"fill_value\":\"IA==\",\"filters\":null,\"order\":\"C\",\"shape\":[3600,64],\"zarr_format\":2}",
        "yyyymmdd/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"time\",\"string64\"],\"long_name\":\"yyyymmdd\",\"units\":\"1\"}",
        "yyyymmdd/0.0": "19801201",
        "yyyymmdd/1.0": "19801202",
        "yyyymmdd/2.0": "19801203",
        "yyyymmdd/3.0": "19801204",
        "yyyymmdd/4.0": "19801205",
        "yyyymmdd/5.0": "19801206",
        "yyyymmdd/6.0": "19801207",
        "yyyymmdd/7.0": "19801208",
        "yyyymmdd/8.0": "19801209",
        "yyyymmdd/9.0": "19801210",

When decoded I get the error message: cannot reshape array of size 8 into shape (1,64)

Which I think is because the part of Zarr that decodes this is expecting a base64 encoded array rather than a string of 8 characters? That or the dimension/chunk/shape is being interpreted incorrectly. How should an array like this be interpreted within Zarr and decoded into an array of 1 by 64 when each chunk is an 8 character string?

dwest77a avatar Feb 14 '24 13:02 dwest77a

The dtype says that this is a 1-char per element field, and there are 8 characters in each entry. The chunk shape is 1,64 - so zarr is right to error. Actually, the values look like filenames, no?

base64 would only require more characters for the same output, so I don't think that's it (although the fill value is suggestive).

martindurant avatar Feb 14 '24 14:02 martindurant

I think this set of NetCDFs just has an extra string field for the date (which is unnecessary but still something that the data provider included). The actual files look like huss_rcp85_land-rcm_uk_12km_01_day_19801201-19901130.nc

There are 8 characters in each entry which are each considered their own chunk. Each chunk is decoded in zarr Array._process_chunk which may be unnecessary since these chunks are not base64 encoded, should this step be skipped for this dtype?

dwest77a avatar Feb 14 '24 14:02 dwest77a

The file might be accessible at https://data.ceda.ac.uk/badc/ukcp18/data/land-rcm/uk/12km/rcp85/01/huss/day/v20190731/huss_rcp85_land-rcm_uk_12km_01_day_19801201-19901130.nc if you want to try converting this file?

dwest77a avatar Feb 14 '24 14:02 dwest77a

I got the file, but I won't have time to look until at least tomorrow.

martindurant avatar Feb 14 '24 14:02 martindurant

No problem, this isn't particularly time-sensitive for me at the moment. Thanks for taking the time!

dwest77a avatar Feb 14 '24 14:02 dwest77a