zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

ZEP0004 Review - Zarr Conventions

Open sanketverma1704 opened this issue 2 years ago • 8 comments

Hi everyone! 👋🏻

I did some preliminary work for ZEP0004 review, as mentioned here.

@rabernat, please have a look and let us know your thoughts. Thanks!

sanketverma1704 avatar Aug 17 '23 23:08 sanketverma1704

Thanks so much @MSanKeys963 for getting this started. It's a perfect place to start. Here's what I will try to do over the next few days

  • [x] Port over some of the text from ZEP4 to the specs website
  • [x] Create a convention template
  • [x] Create at least one full example convention

In the meantime, we can use this PR to continue the discussion started in https://github.com/zarr-developers/zeps/pull/28/files where ZEP4 was first proposed.

@ivirshup I know you have lots of ideas here, and you have been very patient as this ZEP has moved forward very slowly. 🙏 I'd love to hear more about your use cases for conventions anndata and the other projects you're involved with.

rabernat avatar Aug 18 '23 00:08 rabernat

Would it make sense to suggest a zarr convention providing a json-schema in addition to a document?

From what I read in https://github.com/zarr-developers/zeps/pull/28 the goal of ZEP4 is to have a common place where to find information about how to store some domain-specific metadata in a standard-ish way rather than something that should be strictly enforced. So a convention document is more important than a json-schema and the latter should really be optional.

Having a json-schema may be nice to avoid making mistakes or misinterpretations while implementing a convention in a domain-specific library. Using existing tooling would make the process faster too I guess? I'm not familiar with json-schema, though, so I don't know if it is compatible with the modularity and flexibility of zarr conventions as proposed in ZEP4. Are json schemas easily composable?

benbovy avatar Sep 21 '23 07:09 benbovy

Would it make sense to suggest a zarr convention providing a json-schema in addition to a document?

From what I read in zarr-developers/zeps#28 the goal of ZEP4 is to have a common place where to find information about how to store some domain-specific metadata in a standard-ish way rather than something that should be strictly enforced. So a convention document is more important than a json-schema and the latter should really be optional.

Having a json-schema may be nice to avoid making mistakes or misinterpretations while implementing a convention in a domain-specific library. Using existing tooling would make the process faster too I guess? I'm not familiar with json-schema, though, so I don't know if it is compatible with the modularity and flexibility of zarr conventions as proposed in ZEP4. Are json schemas easily composable?

JSON schema makes sense to me, and I have implemented some in Pydantic for a different project. However, it gets a bit ugly when you start using hyphens for key names and symbols for namespaces as proposed in ZEP004.

Stuff like this, i.e. no programming language allows hyphens in variable names and they need aliases. Luckily pydantic has this, but not sure what would happen in other languages. Parsing can be difficult. It also gets very nested and confusing too.

image

CoordinateUnits is a combination of DistanceUnits + more stuff.

image

DistanceUnit is a combination of Metric/Imperial length units etc. Below is "allowed" imperial length units in v1 as an enum (Unit is a StrEnum with a few convenience methods).

image

Any thoughts? The example above allows JSON specification like this:

{"units-v1": {"distance": "ft"}}
// or
{"units-v1": {"angle": "rad"}}

at the end of the day you end up with a schema like this, which is nice, but implementation makes me want to barf :)

image

tasansal avatar Nov 07 '23 22:11 tasansal

Just dropping in having seen the ZEP page https://zarr.dev/zeps/draft/ZEP0004.html - is there any advantage to the flexibility around keeping a convention's configuration inside or not inside its own object within the attributes? I think we could stand to be more opinionated here and require that the config is kept in its own sub-object: this avoids name collisions and keeps everything together. That would also become the obvious place to keep the convention version, rather than having to encode it in the name. It also makes the jsonschema marginally easier, as you only have to describe the convention config object rather than the whole attributes object containing the convention config.

Also this way, the zarr_conventions array could become an object of {"convention_name": {"version": "2", ...}}, so that it only needs to be defined once. This would also allow it to be promoted out of the attributes entirely, although I am going back and forth on that myself as it adds yet another place to look for metadata and doesn't fit with the adjacently-tagged enum convention in the rest of zarr.

Is there a strong argument in favour of allowing free-floating convention configuration exploded through the attributes object?

clbarnes avatar Feb 03 '24 15:02 clbarnes

I think we could stand to be more opinionated here and require that the config is kept in its own sub-object: this avoids name collisions and keeps everything together.

I'm 100% on board with this

Is there a strong argument in favour of allowing free-floating convention configuration exploded through the attributes object?

I'm not aware of any, but I am curious if anyone knows differently.

d-v-b avatar Feb 04 '24 14:02 d-v-b

Is there a strong argument in favour of allowing free-floating convention configuration exploded through the attributes object?

I think this sounds fine.

I would welcome explicit suggestions on the PR. I know that I have been very slow to move this forward. The space of possibilities feels vast. Specifically, @clbarnes - would you like to turn your suggestions into text on the ZEP? I would gladly incorporate that.

The same thing goes for folks who favor JSON schema. Please suggest language you would like to see in the ZEP.

rabernat avatar Feb 04 '24 19:02 rabernat

I have a units convention defined in another open source project. With the current state of things, what's the best way to share this? It has a json schema with namespaces for different unit types:

(edited to be similar to the explicit convention suggestion by @yarikoptic). I really like the JSON schema idea because we can run validation against it. https://mdio-python.readthedocs.io/en/v1/data_models/version_1.html#mdio.schemas.v1.units.MDIOUnitsV1

(Expand units dropdown if it doesn't show up via hyperlink). If you press show json schema it'll show there too. It's all pydantic and pint based.

The way we can currently specify it is like this in the variable attributes.

Within array .zattrs

"units": {"density": "g/cm**3"},

Within group (?) .zattrs

"zarr_conventions": {
 "units": {
   "version": 1, 
   "homepage": "< reorged new link to rtd for convention >",
   "schema-url": "< maybe new repo with metadata conventions in json >"
   }
}

The ZEP is unclear on some aspects. Can we meet sometime to formalize the ZEP, freeze it, and start a concrete implementation? I have many use cases for this :)

Some Qs;

  1. Who maintains the convention? Zarr or individual domain projects, or both? ZEP0004 says it should be hosted on Zarr specs? What if we have a super domain-specific thing the overall Zarr community wouldn't care about?
  2. Where in zarr-specs would the above go? I don't see any current placeholders.
  3. Can be set per array and group. What is the expected behavior? which one overrides which etc? Maybe we should have the conventions allowed only at root group level that applies to whole file?

... and more

tasansal avatar Apr 15 '24 13:04 tasansal

Why should conventional zarr hierarchies be responsible for expressing which conventions they adhere to? (This amounts to the question of why nominal, rather than structural, typing is the right solution here).

Also, how can this effort express conventions w.r.t the layout of arrays and groups in a hierarchy?

An alternative strategy is for Zarr hierarchy consumers to define the conventions they support, and they use the structure of Zarr hierarchies as the "signature" of those conventions. In this scenario, we would benefit from a common language for expressing a Zarr convention as a piece of data. Because the layout of a Zarr hierarchy is invariably part of the structure in-scope for a convention, we need a piece of data that can express the structure + attributes of a Zarr hierarchy. This is addressed by the zarr object models ZEP #https://github.com/zarr-developers/zeps/pull/46.

So, tl;dr, I don't see why we need to define a nominal type system for Zarr attributes, when we can do structural typing on the entire hierarchy (or parts of a hierarchy).

d-v-b avatar Apr 15 '24 14:04 d-v-b