zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Scope?

Open alimanfoo opened this issue 7 years ago • 7 comments

There are at least three types of spec that could live here:

  1. The zarr storage spec. There have been two versions (1 and 2). Content currently lives in the zarr python repo and is published on RTFD within the zarr python docs.

  2. The zarr codec registry. Currently this is undocumented, and is effectively defined as the set of codecs implemented in the zarr-developers/numcodecs repository, which serves as reference implementations. However the codec registry could (should?) be documented independently of the numcodecs implementation, and have a community process for registering new codecs.

  3. Community extensions/conventions. E.g., the set of conventions supported by xarray to implement shared dimensions, or the set of conventions that ultimately becomes NCZarr.

Should they all live here, or should some live elsewhere?

alimanfoo avatar Jan 15 '19 21:01 alimanfoo

My 2c, the storage spec (1) seems like the main focus at the moment, so we could start by migrating that here. We could then consider adding a codec registry (2). Not sure about community extensions/conventions.

alimanfoo avatar Jan 15 '19 22:01 alimanfoo

Not sure about community extensions/conventions.

For this, I think there are multiple levels as well:

  • extensions: if the goal is to have only the core core aspects in the spec, then there will be important ("trusted"?) extensions which likely could and possibly should live here close to the spec.
  • community extensions: but at the same time, it shouldn't be necessary to pass through PR review on this repo to add an extension. Perhaps there could be a list/registry of community extensions that haven't met that hurdle.
  • external extensions (is there a better name?): Then of course, there's everything else which can and should happen as it pleases.

joshmoore avatar Jan 16 '19 11:01 joshmoore

On the subject of extensions/conventions, there are also domain-specific extensions (e.g., we have our own conventions for how we store genome variation data in zarr) as well as general-purpose extensions (e.g., how to represent coordinates and dimensions).

alimanfoo avatar Jan 16 '19 11:01 alimanfoo

Very good point. From our side, I'd see participating on a domain-neutral "image" spec in this repo or as a community spec, and then if we need to further specify for a domain, do that as a community spec or externally.

joshmoore avatar Jan 16 '19 12:01 joshmoore

And then there are perhaps also "conventions": https://github.com/zarr-developers/zarr/issues/280#issuecomment-406986121

joshmoore avatar Jan 16 '19 13:01 joshmoore

One other relevant comment that has come up in discussion, currently the zarr storage spec describes how to store data in any system that can exposes a key-value style interface, but does not provide any specification of concrete storage implementations, e.g., on file systems or cloud object storage.

The spec shows in examples at the end how the abstract key/value storage interface could be implemented in a variety of ways, e.g., by treating keys as file system paths and storing data as files on a file system, or by treating keys as paths within a zip file. The zarr python package has a number of other store implementations, e.g., storage in key-value databases (bdb, lmdb) or relational databases (sqlite3), and other packages like gcsfs and s3fs have store implementations for cloud object storage that can be used with zarr.

However, there are no normative specifications of any of these storage implementations. This means for anyone looking to standardise on a concrete file format, or on a concrete layout of data in cloud object storage, there is no spec to refer to.

So perhaps something like the current storage spec should remain as a specification of everything that happens above an abstract key/value API, and then there should be a collection of separate specs which describe concrete storage implementations, i.e., that translate the abstract key/value requests into concrete operations like writing and reading files or whatever.

Not sure I've found the right terminology yet to describe this separation of concerns, but hopefully the gist is clear.

alimanfoo avatar Jan 18 '19 00:01 alimanfoo

Couple of thoughts. Though these were mentioned at the end of the call as well.

  1. Composability of specs (how do we mix different components)
  2. Absorbing common spec changes upstream (how to make common features generally available)

Thinking about these to make sure that specs represent a small set of relevant "dimensions" that are fairly different from each other.

jakirkham avatar Jan 19 '19 02:01 jakirkham