Broken `fill_value` encoding in `consolidated_metadata` when writing format v2 from zarr-python 3
Zarr version
3.0.7.dev8+g018f61d9
Numcodecs version
0.15.1
Python Version
3.12
Operating System
mac
Installation
from source
Description
I think this is similar to https://github.com/zarr-developers/zarr-python/issues/2741 but for consolidated metadata. Namely that bytes as fill_value for zarr format 2 aren't properly encoded/decoded in consolidated metadata.
Steps to reproduce
import zarr
import json
fill_value = b'A'
group = zarr.group(store='blah.zarr', zarr_format=2, overwrite=True)
array = group.create(name ='x', shape=(3,), dtype='|S4', fill_value=fill_value,)
array[:] = values
zarr.consolidate_metadata("blah.zarr")
# if we load it then the fill value no longer matches
consolidated = zarr.open_group("blah.zarr")
cons_fill = consolidated.metadata.consolidated_metadata.metadata['x'].fill_value
# also directly inspect the non consolidated metadata
array_fill = zarr.open_array("blah.zarr/x").metadata.fill_value
print("Loaded Values")
print(f"array:\t {array_fill}")
print(f"consolidated: \t{cons_fill}")
print('\n')
# directly inspect the actual values on disk
print("On Disk values:")
with open("blah.zarr/x/.zarray") as f:
print(".zarray:", json.load(f)['fill_value'])
with open("blah.zarr/.zmetadata") as f:
print("consolidated/x/.zarray:", json.load(f)['metadata']['x/.zarray']['fill_value'])
outputs:
Loaded Values
array: b'A'
consolidated: [b'65']
On Disk values:
.zarray: QQ==
consolidated/x/.zarray: [65]
Additional output
No response
A datapoint here is that if you base64 decode the on disk value QQ==, you get b'A':
In [1]: import base64
In [2]: base64.b64decode("QQ==")
Out[2]: b'A'
I'm not planing on looking into this further but putting down here the notes from debugging. The incorrect encoding happens in the to_dict function which is defined here: https://github.com/zarr-developers/zarr-python/blob/5f49d24eb7ff1a463d954bdb5b3d14330b86cda5/src/zarr/abc/metadata.py#L18-L37
with the issue being that the bytes were falling into the Sequence branch. A diff like this seems to fix that issue:
diff --git a/src/zarr/abc/metadata.py b/src/zarr/abc/metadata.py
index a56f9866..64ed6f4d 100644
--- a/src/zarr/abc/metadata.py
+++ b/src/zarr/abc/metadata.py
@@ -28,7 +28,7 @@ class Metadata:
value = getattr(self, key)
if isinstance(value, Metadata):
out_dict[field.name] = getattr(self, field.name).to_dict()
- elif isinstance(value, str):
+ elif isinstance(value, str | bytes):
out_dict[key] = value
elif isinstance(value, Sequence):
out_dict[key] = tuple(v.to_dict() if isinstance(v, Metadata) else v for v in value)
but then there is an error in this block:
https://github.com/zarr-developers/zarr-python/blob/5f49d24eb7ff1a463d954bdb5b3d14330b86cda5/src/zarr/core/group.py#L372-L376
when ther V3JsonEncoder doesn't know how to process bytes. It seems like this ought to be a V2Encoder but there is no such object. A diff like this gets most of the way, but then there are other errors in the test suite that I didn't track down.
../../../miniforge3/lib/python3.12/json/encoder.py:439: in _iterencode
diff --git a/src/zarr/core/group.py b/src/zarr/core/group.py
index da2aa5f7..26a1c0db 100644
--- a/src/zarr/core/group.py
+++ b/src/zarr/core/group.py
@@ -369,6 +369,12 @@ class GroupMetadata(Metadata):
},
}
+ if "fill_value" in v:
+ from zarr.core.metadata.v2 import _serialize_fill_value
+
+ d[f"{k}/{ZARRAY_JSON}"]["fill_value"] = _serialize_fill_value(
+ v["fill_value"], np.dtype(v["dtype"])
+ )
items[ZMETADATA_V2_JSON] = prototype.buffer.from_bytes(
json.dumps(
{"metadata": d, "zarr_consolidated_format": 1},
I think this will be fixed by #2874