zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

[v3] Hierarchy api

Open d-v-b opened this issue 8 months ago • 0 comments

This PR adds a declarative API for defining Zarr arrays and groups independently of storage. Using this API, users and developers can create and manipulate Zarr hierarchies, adding nodes and modifying their attributes, and serialize the hierarchy to storage with a single method call.

Implementation

This PR adds a module called hierarchy.py that contains two classes, ArrayModel and GroupModel, which model Zarr arrays and groups, respectively. "Model" here is an important concept;ArrayModel has all the array metadata attributes like shape and dtype, but ArrayModel has no connection to storage, or chunks, so you can't use ArrayModel to read and write array data. Similarly for GroupModel -- it has all the static attributes of a Zarr group, but no connection to storage, so you cannot access sub-groups or sub-arrays with a GroupModel. (You can, however, access sub-GroupModel and sub-ArrayModel instances, but these are just models). The classes are pretty simple, so I will just paste the current code here:

class ArrayModel(ArrayV3Metadata):
    """
    A model of a Zarr v3 array.
    """

    @classmethod
    def from_stored(cls: type[Self], node: Array) -> Self:
        """
        Create an array model from a stored array.
        """
        return cls.from_dict(node.metadata.to_dict())

    def to_stored(self, store_path: StorePath, exists_ok: bool = False) -> Array:
        """
        Create a stored version of this array.
        """
        # exists_ok kwarg is unhandled until we wire it up to the
        # array creation routines

        return Array.from_dict(store_path=store_path, data=self.to_dict())


@dataclass(frozen=True)
class GroupModel(GroupMetadata):
    """
    A model of a Zarr v3 group.
    """

    members: dict[str, GroupModel | ArrayModel] | None = field(default_factory=dict)

    @classmethod
    def from_stored(cls: type[Self], node: Group, *, depth: int | None = None) -> Self:
        """
        Create a GroupModel from a Group. This function is recursive. The depth of recursion is
        controlled by the `depth` argument, which is either None (no depth limit) or a finite natural number
        specifying how deep into the hierarchy to parse.
        """
        members: dict[str, GroupModel | ArrayModel] = {}

        if depth is None:
            new_depth = depth
        else:
            new_depth = depth - 1

        if depth == 0:
            return cls(**node.metadata.to_dict(), members=None)

        else:
            for name, member in node.members:
                item_out: ArrayModel | GroupModel
                if isinstance(member, Array):
                    item_out = ArrayModel.from_stored(member)
                else:
                    item_out = GroupModel.from_stored(member, depth=new_depth)

                members[name] = item_out

        return cls(attributes=node.metadata.attributes, members=members)

Goals

  • This work is necessary for single-shot hierarchy creation with batched IO. If we can leverage batched IO operations, it should be possible to concurrently write (and read) all the zarr.json metadata documents in a large hierarchy, which should vastly speed up these interactions on high latency storage
  • a flattened consolidated-metadata-like internal representation for easy hierarchy creation. A Zarr hierarchy can be represented as dict[str_that_obeys_path_semantics, ArrayModel | GroupModel]. This has been useful over in pydantic-zarr for a variety of things, and I think it would be useful here. It could also provide a serialization format for consolidated metadata in zarr v3, which so far has not been defined.

Process

Unlike a lot of other v3 efforts, this PR adds new functionality that was never in zarr-python before. I'm basing the design here on work I did over in pydantic-zarr, so there's some of prior art, but I am happy to explore and experiment as needed. It might take a while before we have an API everyone is happy with.

d-v-b avatar May 26 '24 16:05 d-v-b