zarr-python
zarr-python copied to clipboard
zarr validation and consistency checking
i could not find any explicit reference in the documentation to validating a zarr store, hence opening this issue.
we are supporting zarr nested directory stores as a file type for our data archive and looking to validate and inspect the structure of the input before upload. some questions have come up that i am posting here:
- is it simply sufficient to use the zarr reader to open a zarr store, and if it opens, the zarr store is valid, if not it will raise an exception?
- if someone accidentally puts in extra files in the directory, is there a way for us to detect these files as not relevant to the store?
- does the zarr python library already have a consistency check util? and can it detect which underlying directory elements are different, through some form of a tree hash for example ?
I've found the answer to the first question already: an invalid zarr will not always raise an error immediately upon opening. For example, if a chunk file is malformed, this won't be detected until you actually try to use the containing array.
Welcome and thanks for the questions.
For example, if a chunk file is malformed, this won't be detected until you actually try to use the containing array.
This is definitely deliberate behavior. Zarr arrays can be petabytes in size with millions of chunks! Individually checking each chunk on opening would not be the right default behavior. Missing chunks are valid in Zarr as well--they represent missing data.
- is it simply sufficient to use the zarr reader to open a zarr store, and if it opens, the zarr store is valid, if not it will raise an exception?
Depends on what you mean by "valid"? If you are asking if the store can be opened by Zarr, then yes, this is sufficient. If you are asking whether your data have been corrupted, then no. You may consider using array.hexdigest to verify data integrity.
- if someone accidentally puts in extra files in the directory, is there a way for us to detect these files as not relevant to the store?
Zarr will just ignore those files. I don't think they'll break anything.
- does the zarr python library already have a consistency check util? and can it detect which underlying directory elements are different, through some form of a tree hash for example ?
See comments above about hexdigest. Also ongoing discussions in #877.
@rabernat - thank you.
ah hexdigest would apply as an overall checksum. we can compute it, but hexdigest could potentially be a very expensive operation. good to know it exists. we are (at least for our backend on s3) working on a tree-hash scheme to store checksums associated with every file and "directory" in the tree.
if zarr ignores any irrelevant files we may even consider computing and storing the checksums locally or in some zipped checksum store (to prevent inode explosion) and if this works we may propose a tree hash scheme for diff detection. if you already have any conversations on diff detection, would love to know.
sharding support would be fantastic and would really help optimize the nested directory structure to minimize the number of files. i'm hoping this won't break any xarray type access when it's implemented and would be transparent to any end user. given the datasets we are handling the current recommended chunk size is 64**3 and that's resulting in about a million files per zarr store.
we are (at least for our backend on s3) working on a tree-hash scheme to store checksums associated with every file and "directory" in the tree.
In that case you may be interested in the conversation in https://github.com/zarr-developers/zarr-python/issues/392#issuecomment-890018515 and https://github.com/zarr-developers/zarr-specs/issues/82. IPFS solves this problem very elegantly, and a lot of us are interested in plugging Zarr into IPFS.
In that case you may be interested in the conversation in #392 (comment) and zarr-developers/zarr-specs#82. IPFS solves this problem very elegantly, and a lot of us are interested in plugging Zarr into IPFS.
i love ipfs (at least the concept), the efficiency is not quite there yet for practical use. yes, ipfs would solve several of these things. we have a bottleneck in that ipfs would require a client running in front of it, and since we are using a public dataset program, we have some constraints in terms of how to support it. we are indeed considering ipfs (or its variants) as a part of an institutional infrastructure across universities. i'll check in on those conversations.
Hi @satra,
A few quick answers while we see if anyone else in the community has built anything.
- is it simply sufficient to use the zarr reader to open a zarr store, and if it opens, the zarr store is valid, if not it will raise an exception?
In terms of the metadata, I'd believe so. zarr-python tends to be fairly lenient about the chunks until access (and missing chunks are considered legitmate)
- if someone accidentally puts in extra files in the directory, is there a way for us to detect these files as not relevant to the store?
The files that are relevant to the store are quite limited. If you everything but ^\d+$ and ^[.]z.*$ you should be alright.
- does the zarr python library already have a consistency check util? and can it detect which underlying directory elements are different, through some form of a tree hash for example ?
Not that I know of. See also https://github.com/zarr-developers/zarr-python/issues/392
Edit: interesting! I didn't see any of the previous responses when I was responding...
It looks like this is now being addressed by zarr_checksum. Is that right @satra?
@jakirkham - indeed that's a tree hash algo we implemented for our needs and using that digest for files in dandi. it's a pure object based hash with no semantics. we may in the future also want to consider an isomorphic hash, where the bits can change, but the content is the same (e.g. moving from uint8 to uint16).
also given the sizes of file trees, we may want to consider ways to optimize both hash check and diff detection.
i'll close this for now. i had completely forgotten about this issue, so thank you @jakirkham
@satra you might be interested in pydantic-zarr. It's designed to normatively represent zarr hierarchies. I think some of the things you are looking for could be built with this library, and its very small (right now), so you could just implement the same functionality in your own tooling very easily without adding it as a dependency.
thanks @d-v-b looks nice and would be easy to incorporate since we already have a pydantic based setup for our schema.
a possibility that we are experimenting with in a few projects is to use linkml that abstracts out the metadata model into a yaml definition and then uses generators to create various toolkits (amongst it pydantic). there are many little issues at this point, but they have effectively collapsed a lot of the patterns we use across projects into a single markup language + generators.
is there anything specific you'd need from zarr-python to make this easier? something on my wishlist is a specification for a JSON-serializable representation of a zarr hierarchy, which would make pydantic-zarr merely one implementation of that spec.
@d-v-b - sorry for the very late response. indeed linkml's data model would allow that and i know some of the linkml folks are in conversation with the NWB folks regarding array data type in linkml as well. here is an intro talk covering basics of linkml: https://zenodo.org/record/7778641
i think it would be a good opportunity to turn the zarr spec into a data model that may fit in with many different worlds of use cases.
This is indeed actively being worked upon within the LinkML team at the moment. Just tagging @rly who is currently involved in this.