zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

consolidated metadata storage for Zarr v3

Open grlee77 opened this issue 3 years ago • 12 comments

In the current draft of the v3 spec in zarr-developers/zarr-python#898, consolidated metadata has been implemented similarly to v2. Currently this metadata gets stored in the group meta/root/consolidated, but this is not part of the spec. Likely we need to specify an extension for how consolidated metadata should be handled.

grlee77 avatar Mar 11 '22 02:03 grlee77

Is the V3 considering metadata compression? I'm working with a dataset where the consolidated metadata (v2) has 1.2GB in size, and opening it from S3 takes a very long time to open. After gzipping the metadata, it reduces to 20MB, so it would be nice to have compressed metadata.

ianliu avatar Sep 06 '22 12:09 ianliu

Hi @ianliu, thanks this is very useful to know. Adding a capability for compression into a v3 extension for consolidated metadata sounds like a good idea to me.

alimanfoo avatar Sep 07 '22 08:09 alimanfoo

@ianliu I'm curious what makes the metadata so big in your case? Do you have lots of arrays? Or very large custom attrs?

V3 allows for defining of custom metadata encoding, so a compressed format could be used.

Also, a major question for V3 is whether user metadata is a separate document or not (see https://github.com/zarr-developers/zarr-specs/pull/149#discussion_r942225058). If you have 1GB of custom attrs in a single array, that would be a strong motivation not to put user metadata together with the core array metadata.

rabernat avatar Sep 07 '22 11:09 rabernat

@rabernat in my particular case, I have many small arrays. It could be argued that there is a better data representation that would consolidate multiple arrays into a single array, thus reducing metadata, but the way our program is setup makes this a little bit trickier (mainly because our program produces results in parallel).

ianliu avatar Sep 07 '22 13:09 ianliu

:100: for potential consolidated metadata features but

but this is not part of the spec. Likely we need to specify an extension for how consolidated metadata should be handled.

would this suggest stripping the current consolidated metadata support in V3 until there's an extension?

joshmoore avatar Sep 12 '22 14:09 joshmoore

I've drafted a version of this at https://github.com/TomAugspurger/zeps/blob/consolidated-metadata/draft/ZEP0006.md. A few questions before turning that into a PR:

Is this appropriate for a ZEP? Or would it go in the zarr-specs repo?


One major design decision is whether this consolidated metadata should be embedded in the zarr.json, or whether it should go in a file next to the root zarr.json (e.g. consolidated_metadata.json). Doing it in the zarr.json would make collecting all the metadata possible in a single HTTP request. With separate zarr.json and consolidated_metadata.json you would need two. The downside of putting it in zarr.json is potentially making any read of that file slow, even if you don't need the consolidated metadata.

I'm also a bit curious about how much async zarr-python reduces the pain of hierarchies with a large number of nodes. With dozens of HTTP requests in flight at the same time, maybe it's not so bad? If https://github.com/zarr-developers/zarr-specs/issues/284 (listing the child nodes in the root zarr.json) were implemented, then you'd be able to make all but the first HTTP request concurrently.

TomAugspurger avatar Aug 21 '24 18:08 TomAugspurger

I like the draft ZEP @TomAugspurger. Thanks for putting it together.

As far as I understand, the current spec does not provide a way to add arbitrary extensions to a metadata document (this could be changed though). Provided we can address this challenge, we likely will want to support both an inline and external consolidated metadata object.

// inline
"consolidated_metadata": {
    "kind": "inline":
    "metadata": {
        "air": {
          "shape": [2920, 25, 53],
          "fill_value": 0,
          ...
          "node_type": "array",
      },
      "lat": {
          "shape": [25],
          "fill_value": NaN,
          ...
          "node_type": "array",
      }
  }
}
// external
"consolidated_metadata": {
    "kind": "external":
    "metadata": "consolidated_metadata.json"
}

An alternative direction we could take is to define a Consolidated Metadata Store which could enable this functionality without needing to touch the root metadata document.

jhamman avatar Aug 21 '24 22:08 jhamman

the current spec does not provide a way to add arbitrary extensions to a metadata document

In the sense that that's just not spelled out anywhere as a thing that can be done? Or that adding additional keys might break things?

we likely will want to support both an inline and external consolidated metadata object

I like the idea of future-proofing things by including a "kind" enum to indicate where to find the consolidated metadata, maybe we pick only one for the spec? This is minor, but I worry about putting too much complexity on the implementors.

TomAugspurger avatar Aug 22 '24 01:08 TomAugspurger

In the sense that that's just not spelled out anywhere as a thing that can be done? Or that adding additional keys might break things?

I don't think we have defined the expected behavior for what will happen if an implementation encounters and unexpected key. So yes, I think thinks could break.

edit: there is this section in the spec:

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

Now, I don't know how/where you would put "must_understand" in this case. See also https://github.com/zarr-developers/zarr-specs/issues/270

maybe we pick only one for the spec?

+1 one, pick one for now. We can always expand this later.

jhamman avatar Aug 22 '24 06:08 jhamman

I'm not sure I see the motivation for disallowing extra keys in the metadata objects. That seems to make it hard to evolve the spec in forwards-compatible ways.

Anyway, maybe in this case it's fine to use because we're making a "future version of the spec"? And I guess we'll set must_understand: false to indicate that older versions shouldn't error, they should just ignore the extra key and fall back to the non-consolidated metadata.

TomAugspurger avatar Aug 22 '24 16:08 TomAugspurger

Is this appropriate for a ZEP? Or would it go in the zarr-specs repo?

In my mind, this is a great ZEP.

from PR (TODO: is there a better way to indicate that a dataset uses an extension?)

That should definitely be some form of metadata in zarr.json. I'd very much be for using this as a way to kick the tires of the extension process. i.e., if you/we can't do it, then that needs to be fixed at the spec level.

jhamman commented 2 days ago As far as I understand, the current spec does not provide a way to add arbitrary extensions to a metadata document (this could be changed though)

I think it's implicitly possible but not (well-)documented. So :+1: for doing what we need to do to make it possible.

TomAugspurger commented 20 hours ago An alternative direction we could take is to define a Consolidated Metadata Store which could enable this functionality without needing to touch the root metadata document.

I'd urge caution here. This led to many issues in v2. Minimally it should be a "wrapping store" so that they can be composed, because otherwise they lead to an explosion of classes that are needed. (e.g. "NestedRemoteDirectoryStore") Additionally, I'd suggestthe metadata MUST be in the zarr.json regardless of the implementation.

Now, I don't know how/where you would put "must_understand" in this case. See also https://github.com/zarr-developers/zarr-specs/issues/270

:+1: for having must_understand set. FWIW, I've also been considering writing a ZEP to say that "all metadata will be in a RO-Crate file" for OME-Zarr. This might could go hand in hand. My thinking was there should be a flag in zarr.json to say "all metadata loading should follow this other protocol", i.e. a loader or something that could be pip installed.

joshmoore avatar Aug 23 '24 13:08 joshmoore

Whoops, I just pushed this as a PR at https://github.com/zarr-developers/zarr-specs/pull/309, rather than a ZEP, sorry.

That should definitely be some form of metadata in zarr.json. I'd very much be for using this as a way to kick the tires of the extension process. i.e., if you/we can't do it, then that needs to be fixed at the spec level.

I'll open a separate issue to discuss this. In my experience, STAC's method of each object containing a stac_extensions list with a list of extensions has worked well.

TomAugspurger avatar Aug 23 '24 16:08 TomAugspurger