zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Broken `fill_value` encoding in `consolidated_metadata` when writing format v2 from zarr-python 3

Open ianhi opened this issue 9 months ago • 3 comments

Zarr version

3.0.7.dev8+g018f61d9

Numcodecs version

0.15.1

Python Version

3.12

Operating System

mac

Installation

from source

Description

I think this is similar to https://github.com/zarr-developers/zarr-python/issues/2741 but for consolidated metadata. Namely that bytes as fill_value for zarr format 2 aren't properly encoded/decoded in consolidated metadata.

Steps to reproduce

import zarr
import json

fill_value = b'A'
group = zarr.group(store='blah.zarr', zarr_format=2, overwrite=True)
array = group.create(name ='x', shape=(3,), dtype='|S4', fill_value=fill_value,)
array[:] = values
zarr.consolidate_metadata("blah.zarr")

# if we load it then the fill value no longer matches
consolidated = zarr.open_group("blah.zarr")
cons_fill = consolidated.metadata.consolidated_metadata.metadata['x'].fill_value

# also directly inspect the non consolidated metadata
array_fill = zarr.open_array("blah.zarr/x").metadata.fill_value
print("Loaded Values")
print(f"array:\t {array_fill}")
print(f"consolidated: \t{cons_fill}")
print('\n')


# directly inspect the actual values on disk

print("On Disk values:")
with open("blah.zarr/x/.zarray") as f:
    print(".zarray:", json.load(f)['fill_value'])

with open("blah.zarr/.zmetadata") as f:
    print("consolidated/x/.zarray:", json.load(f)['metadata']['x/.zarray']['fill_value'])

outputs:

Loaded Values
array:	 b'A'
consolidated: 	[b'65']


On Disk values:
.zarray: QQ==
consolidated/x/.zarray: [65]

Additional output

No response

ianhi avatar Apr 11 '25 22:04 ianhi

A datapoint here is that if you base64 decode the on disk value QQ==, you get b'A':

In [1]: import base64

In [2]: base64.b64decode("QQ==")
Out[2]: b'A'

jhamman avatar Apr 12 '25 16:04 jhamman

I'm not planing on looking into this further but putting down here the notes from debugging. The incorrect encoding happens in the to_dict function which is defined here: https://github.com/zarr-developers/zarr-python/blob/5f49d24eb7ff1a463d954bdb5b3d14330b86cda5/src/zarr/abc/metadata.py#L18-L37

with the issue being that the bytes were falling into the Sequence branch. A diff like this seems to fix that issue:

diff --git a/src/zarr/abc/metadata.py b/src/zarr/abc/metadata.py
index a56f9866..64ed6f4d 100644
--- a/src/zarr/abc/metadata.py
+++ b/src/zarr/abc/metadata.py
@@ -28,7 +28,7 @@ class Metadata:
             value = getattr(self, key)
             if isinstance(value, Metadata):
                 out_dict[field.name] = getattr(self, field.name).to_dict()
-            elif isinstance(value, str):
+            elif isinstance(value, str | bytes):
                 out_dict[key] = value
             elif isinstance(value, Sequence):
                 out_dict[key] = tuple(v.to_dict() if isinstance(v, Metadata) else v for v in value)

but then there is an error in this block:

https://github.com/zarr-developers/zarr-python/blob/5f49d24eb7ff1a463d954bdb5b3d14330b86cda5/src/zarr/core/group.py#L372-L376

when ther V3JsonEncoder doesn't know how to process bytes. It seems like this ought to be a V2Encoder but there is no such object. A diff like this gets most of the way, but then there are other errors in the test suite that I didn't track down.

../../../miniforge3/lib/python3.12/json/encoder.py:439: in _iterencode
diff --git a/src/zarr/core/group.py b/src/zarr/core/group.py
index da2aa5f7..26a1c0db 100644
--- a/src/zarr/core/group.py
+++ b/src/zarr/core/group.py
@@ -369,6 +369,12 @@ class GroupMetadata(Metadata):
                             },
                         }

+                    if "fill_value" in v:
+                        from zarr.core.metadata.v2 import _serialize_fill_value
+
+                        d[f"{k}/{ZARRAY_JSON}"]["fill_value"] = _serialize_fill_value(
+                            v["fill_value"], np.dtype(v["dtype"])
+                        )
                 items[ZMETADATA_V2_JSON] = prototype.buffer.from_bytes(
                     json.dumps(
                         {"metadata": d, "zarr_consolidated_format": 1},

ianhi avatar Apr 16 '25 18:04 ianhi

I think this will be fixed by #2874

ianhi avatar May 17 '25 20:05 ianhi