zarr-python
zarr-python copied to clipboard
[v3] Hierarchy api
This PR adds a declarative API for defining Zarr arrays and groups independently of storage. Using this API, users and developers can create and manipulate Zarr hierarchies, adding nodes and modifying their attributes, and serialize the hierarchy to storage with a single method call.
Implementation
This PR adds a module called hierarchy.py
that contains two classes, ArrayModel
and GroupModel
, which model Zarr arrays and groups, respectively. "Model" here is an important concept;ArrayModel
has all the array metadata attributes like shape
and dtype
, but ArrayModel
has no connection to storage, or chunks, so you can't use ArrayModel
to read and write array data. Similarly for GroupModel
-- it has all the static attributes of a Zarr group, but no connection to storage, so you cannot access sub-groups or sub-arrays with a GroupModel
. (You can, however, access sub-GroupModel and sub-ArrayModel instances, but these are just models). The classes are pretty simple, so I will just paste the current code here:
class ArrayModel(ArrayV3Metadata):
"""
A model of a Zarr v3 array.
"""
@classmethod
def from_stored(cls: type[Self], node: Array) -> Self:
"""
Create an array model from a stored array.
"""
return cls.from_dict(node.metadata.to_dict())
def to_stored(self, store_path: StorePath, exists_ok: bool = False) -> Array:
"""
Create a stored version of this array.
"""
# exists_ok kwarg is unhandled until we wire it up to the
# array creation routines
return Array.from_dict(store_path=store_path, data=self.to_dict())
@dataclass(frozen=True)
class GroupModel(GroupMetadata):
"""
A model of a Zarr v3 group.
"""
members: dict[str, GroupModel | ArrayModel] | None = field(default_factory=dict)
@classmethod
def from_stored(cls: type[Self], node: Group, *, depth: int | None = None) -> Self:
"""
Create a GroupModel from a Group. This function is recursive. The depth of recursion is
controlled by the `depth` argument, which is either None (no depth limit) or a finite natural number
specifying how deep into the hierarchy to parse.
"""
members: dict[str, GroupModel | ArrayModel] = {}
if depth is None:
new_depth = depth
else:
new_depth = depth - 1
if depth == 0:
return cls(**node.metadata.to_dict(), members=None)
else:
for name, member in node.members:
item_out: ArrayModel | GroupModel
if isinstance(member, Array):
item_out = ArrayModel.from_stored(member)
else:
item_out = GroupModel.from_stored(member, depth=new_depth)
members[name] = item_out
return cls(attributes=node.metadata.attributes, members=members)
Goals
- This work is necessary for single-shot hierarchy creation with batched IO. If we can leverage batched IO operations, it should be possible to concurrently write (and read) all the
zarr.json
metadata documents in a large hierarchy, which should vastly speed up these interactions on high latency storage - a flattened consolidated-metadata-like internal representation for easy hierarchy creation. A Zarr hierarchy can be represented as
dict[str_that_obeys_path_semantics, ArrayModel | GroupModel]
. This has been useful over inpydantic-zarr
for a variety of things, and I think it would be useful here. It could also provide a serialization format for consolidated metadata in zarr v3, which so far has not been defined.
Process
Unlike a lot of other v3 efforts, this PR adds new functionality that was never in zarr-python
before. I'm basing the design here on work I did over in pydantic-zarr
, so there's some of prior art, but I am happy to explore and experiment as needed. It might take a while before we have an API everyone is happy with.