zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Lessons to learn from STAC's extensibility

Open TomAugspurger opened this issue 1 year ago • 10 comments

As mentioned in https://github.com/zarr-developers/zarr-specs/pull/309, I ran across some challenges with how the Zarr v3 spec does extensions. I think that we might be able to learn some lessons from how STAC handles extensions.


tl/dr: I think Zarr would benefit from a better extension story that removed the need to have any involvement from anyone other than the extension author and any tooling wishing to use that extension. JSON schema + a zarr_extensions field on Group and Array would get us most of the way there. The current requirements of must_understand: false and name: URL in the extension objects feels like a weaker version of this.


How STAC does extensibility

STAC is a JSON-based format for cataloging geospatial assets. https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#overview lays out how STAC allows itself to be extended, but there are a few key components

  1. STAC uses jsonschema to define schemes for both the core metadata and extensions.
  2. All STAC objects (Collection, Item, etc.) include a stac_version field.
  3. All STAC objects (Collection, Item) include a stac_extensions array with a list of URLs to JSON Schema definitions that can be used for validation.

Together, these are sufficient to allow extensions to extend basically any part of STAC without any involvement from the core of STAC. Tooling built around STAC coordinates through stac_extensions For example, a validator can load the JSON schema definitions for the core metadata (using the stac_version field) and all extensions (using the URLs in stac_extensions) and validate a document against those schemas. Libraries wishing to use some feature can check for the presence of a specific stac_extension URL.

You also get the ability to version things separately. The core metadata can be at 1.0.0, while the proj extension is a 2.0.0 without issue.

How that might apply to Zarr

Two immediate reactions to the thought of applying that to Zarr:

  1. Zarr does have JSON documents for describing the metadata of nodes in a Zarr hierarchy. We could pretty easily take the same concepts and apply them more or less directly to the Group and Array definitions (and possibly other fields within; STAC does this as well for, e.g. Assets which live inside an Item).
  2. STAC is entirely JSON-based, while much of Zarr concerns how binary blobs are stored, transformed, etc. While portions of these extension points might be configured (and validated by JSON schema) in the metadata document, much of it will lie outside.

How does this relate to what zarr has today?

I'm not sure. I was confused about some things reading https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points. The spec seems overly prescriptive about putting keys in the top level of the metadata:

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

STAC / JSON schema takes the opposite approach to their metadata documents. Any extra fields are allowed and ignored by default, but schemas (core or extension) can define required fields.

Specifications for new extensions are recommended to be published in the zarr-developers/zarr-specs repository via the ZEP process. If a specification is published decentralized (e.g. for initial experimentation or due to a very specialized scope), it must use a URL in the name key of its metadata, which identifies the publishing organization or individual, and should point to the specification of the extension.

Having a central place to advertise extensions is great. But to me having to write a ZEP feels like a pretty high bar. STAC extensions are quick and easy to create, and that's led to a lot of experimentation and eventual stabilization in STAC core. And some institutions will have private STAC extensions that they never intend to publish. IMO the extension story should lead with that and offer a zarr-extensions repository / organization for commonly used extensions / shared maintenance.

TomAugspurger avatar Oct 11 '24 02:10 TomAugspurger

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

Worth noting that the first and third sentences are blatantly contradictory! :upside_down_face:

Having a central place to advertise extensions is great. But to me having to write a ZEP feels like a pretty high bar. STAC extensions are quick and easy to create, and that's led to a lot of experimentation and eventual stabilization in STAC core. And some institutions will have private STAC extensions that they never intend to publish. IMO the extension story should lead with that and offer a zarr-extensions repository / organization for commonly used extensions / shared maintenance.

:100: this sounds like a great idea. I think requiring a ZEP for every extension is a headache and the end result will be that nobody does it. I'd be happy adjusting #312 along the lines of a separate zarr-extensions repo if people generally think that's a good idea.

d-v-b avatar Oct 11 '24 10:10 d-v-b

Thanks for sharing this Tom. It has been great to have you spending time on Zarr recently and bringing a fresh perspective to long-standing discussions. FWIW, I'm on record in multiple conversations as citing STAC as a good example for Zarr to emulate.

I do think that Zarr, as an actual file format (as opposed to a catalog format) may need a somewhat more conservative attitude than STAC regarding backwards compatibility, interoperability etc. It must be very clear to data producers, for example, how to create data that will be widely readable for a long period of time without any need to update the metadata.

However, I agree that our current approach to extensions basically doesn't work and is effectively preventing development. It's not even possible for Zarr Python to reach feature parity with Zarr V2 without multiple non-existent extensions (e.g. strings)--let alone innovating in new directions. So I am fully in favor of what is proposed here.

One concept that may be very useful for Zarr is the notion of extension maturity: https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#extension-maturity. This would guide data providers on how "risky" it would be to adopt a specific extension. This could be seen as a more nuanced version than "must understand" True / False.

I think this concept would also make obsolete my stalled proposal for Zarr "conventions": #262.

I'm also strongly in favor of adopting JSON schema for metadata conformance validation.


What do we need to do to move this forward? I suppose we need a ZEP propose an update the spec to redefine how extensions work. 😵‍💫 I'd be happy to lead that effort if it would be helpful.

rabernat avatar Oct 11 '24 12:10 rabernat

I suppose we need a ZEP propose an update the spec to redefine how extensions work

Yeah, that's the sticking point. We need some way to break the current logjam.

Thinking a bit more, I guess the addition of zarr_extensions array is only necessary if we also intend to use jsonschema for validation for both the core metadata and extensions. I think the main thing to figure out is how the different fields that make up the final object are versioned (and potentially validated against a schema).

Take consolidated metadata as an example: regardless of whether zarr_extensions is used, you'll end up with a similar metadata document for a Group. For example, with zarr_extensions:

{
  "zarr_format": 3,
  // ...
  "consolidated_metadata": {
    "must_understand": false,
    "name": ...,
    ...
  },
  "zarr_extensions": ["https://github.com/zarr-extensions/consolidated-metadata/v1.0.0/schema.json"]
}

Or without zarr_extensions, with the version of the consolidated metadata extension inlined:

{
  "zarr_format": 3,
  // ...
  "consolidated_metadata': {
    "must_understand": false,
    "version": "1.0.0",
    ...,
  }
}

The advantage of zarr_extensions is a uniform way for tools to validate the contents of core and extension metadata. Whether or not trying to introduce something like that at this stage of zarr v3, I'm not sure.

TomAugspurger avatar Oct 12 '24 20:10 TomAugspurger

Thanks for sharing this, @TomAugspurger. I went through STAC's extension README, and I like how they've decoupled the extensions from the core. The ability to work on extensions without the involvement of the core specification authors or, in our case, the ZSC/ZIC could prove useful.

Going back to conversations I had with @alimanfoo in 2022, I think Alistair envisioned something similar for extensions — the community working on their extensions unrestrictively.

I also like how the STAC extensions webpage neatly lists the extensions. We could work on a similar repository/organisation for authors who would like to host their extensions under zarr-developers while also having the option to host their extensions outside of zarr-developers GitHub.

We worked on the ZEP process when the Zarr community needed a mechanism to solicit feedback and move forward in a structured manner. It worked well and helped us to finalise two proposals (ZEP1 and ZEP2), but if it's proving to be a roadblock for further development, then we should make changes to it.

I'm curious to hear @joshmoore and @jakirkham's thoughts.


My thoughts on moving this forward: I have a PR, https://github.com/zarr-developers/zeps/pull/59, which will revise the existing ZEP process. Among other changes, my PR removes the requirement of a ZEP proposal for extensions. Please check and review. 🙏🏻

I'm also happy to write or collaborate with @rabernat on a ZEP proposal outlining the new process for extensions.

sanketverma1704 avatar Oct 17 '24 11:10 sanketverma1704

regardless of whether zarr_extensions is used, you'll end up with a similar metadata document for a Group.

I'm not particularly (at all) familiar with the design decisions of STAC so a question: what are the trade-offs of having the new JSON object (here: consolidated_metadata) at the top-level and not within the extensions object itself?

Assuming embedding it under something like "extensions" is viable, it occurs to me if we could resurrect that field (which was previously in v3) by making use of must_understand recursively. The field "extension" would make use of the extension (no quotes) mechanism itself. Further extensions (if that's too confusing, then another name like plugins, etc.) could be embedded in that object. They in turn have a "must_understand" field and that if ANY of those is True, then the top-level is true as well.

Tom's example from above might look like this:

{
  "zarr_format": 3,
  "extensions": {
    "must_understand": true,
    "https://github.com/zarr-extensions/consolidated-metadata/v1.0.0/schema.json": {
      "must_understand": false,
      "name": "..."
    },
    "https://github.com/zarr-extensions/something-else/schema.json": {
      "must_understand": true
    }
}

(If multiple objects of the same extension are needed, then this could be a list of dicts rather than a dict)

The benefits would be:

  • we introduce a clear place for all extensions
  • we make use of the existing v3 must_understand logic to not break
  • No namespace collisions (e.g., from two extensions which define the same name)

joshmoore avatar Nov 13 '24 18:11 joshmoore

what are the trade-offs of having the new JSON object (here: consolidated_metadata) at the top-level and not within the extensions object itself?

In STAC, stac_extensions is an array (of URLs to jsonschema definitions), not an object.

Where in the document the fields defined by an extension go (top level or under extensions) doesn't matter from the point of view of json schema: you just need to ensure that the definition matches the usage.

Requiring that extensions place their additional fields under extensions only helps with namespace collisions between an extension's field and the core spec (including future versions of the spec). It doesn't help with collisions between extensions, at least not at the json schema level. You could require by convention that all extensions use a namespace, but that's just a convention.

TomAugspurger avatar Nov 13 '24 21:11 TomAugspurger

I agree that a separate extensions object doesn't necessarily help --- I argued against that previously because I don't see a strong benefit in distinguishing between what was in the first version of the core spec and what is added in subsequent versions.

I do think it is valuable to avoid name collisions --- but I think we can accomplish that by using suitable unambiguous names in the top level equally as well as using such names within a nested extensions object.

If the goal is to define and implement extensions without any central review, then to avoid collisions, then we should use a naming scheme for any top-level metadata fields added by extensions that avoids the possibility of collisions without relying on central review. The simplest solution is to use a domain name / URL prefix under the control of the extension author. For example, you could use:

{
  "zarr_format": 3,
  "https://github.com/TomAugspurger/consolidated-metadata": {
    "must_understand": false,
   ...
  }
}

or

{
  "zarr_format": 3,
  "github.com/TomAugspurger/consolidated-metadata": {
    "must_understand": false,
   ...
  }
}

Using https://github.com/zarr-extensions/... would imply at least the approval of whoever is managing that github organization. Maybe the barrier for that could be extremely low, e.g. first come, first serve. But it is probably simpler to avoid even that level of central review for extensions intended not to be centrally reviewed.

jbms avatar Nov 14 '24 05:11 jbms

FWIW, name collisions haven't been a problem in STAC. The convention to include a prefix in your newly defined keys (proj:shape, for the shape field defined by the projection extension) is widely followed.

TomAugspurger avatar Nov 14 '24 14:11 TomAugspurger

I have documented problems that I see with the current registered attributes approach and why I think it would be better to adopt STAC's mechanism. I hope the folks at the Zarr Summit will be interested in discussing IRL, but also welcome comments from on-line.

Based on early experience with the current system (particularly zarr-developers/zarr-extensions#21), I believe it concentrates too much power in the @zarr-developers/steering-council without commensurate benefits.

I've also documented specific technical issues around maturity classification and namespace collisions that will become harder to address as adoption grows. I think pivoting now would save significant pain later.

cc @emmanuelmathot @vincentsarago @jsignell @sharkinsspatial


Claude was used to improve the document

maxrjones avatar Oct 13 '25 22:10 maxrjones

@maxrjones Here are some responses to your comments:

What I originally had in mind with the registered attributes proposal is that at least most registered names would include a clear prefix over which the proposer has reasonable authority (e.g. name of their organization) which would:

(1) make conflicts with existing uses unlikely, and also make it unlikely that someone would unintentionally start using it in the future without knowing about the extension; (2) avoid the need for the steering council to consider whether the name should be reserved for a "better" proposal in the future.

I also imagined that some consistent syntax would be used to allow registered attributes to be easily identified as such. However, we never seem to have decided on such a syntax.

In this case, the steering council just needs to verify that some basic requirements are met, and that the proposer has reasonable authority over the prefix; the actual design of the specification isn't relevant.

For a name without a prefix, or where the proposer does not necessarily have authority over the prefix, it becomes more necessary for the steering council to consider the design itself, especially if the name seems to be a "high value name" that could be useful for something else. In that case there is both a greater burden on the proposer, and there may be limited time available to review such proposals. Therefore it would be best to try avoid such proposals.

One conflict is that there seems to be a desire to standardize existing unprefixed attributes like _FillValue.

I think one other point of confusion is that someone may have different intents in posting a PR to zarr-extensions:

  • The proposer may already have it implemented and be using it, especially with a clearly non-conflicting prefix, and just want it to get accepted as quickly as possible without any changes to the specification.
  • The proposer may be proposing it specifically to get design feedback.

Regarding your license concern: As far as I understand, the license only applies to the content added to the zarr-extensions repo itself, which may merely be a very brief description and a link to the actual specification hosted elsewhere. I don't think that imposes any significant burden even for attributes intended only for proprietary use.

If the desire is to have a fully decentralized model based on URLs to specify a JSON schema, you could just use the URL itself as the attribute name.

Regarding discoverability:

No URLS in metadata for finding specifications, so a library must first look through the Zarr extensions repository to discover the options

I would expect in most cases an implementation would just support a fixed set of registered attributes. Therefore, discoverability would in most cases not be a concern that implementations directly need to handle in a programmatic way. But in any case every attribute has a corresponding location within the zarr-extensions repo (or rather, will have one once we determine the appropriate naming convention), so you can just check if there is a registered attribute by that name. For implementations that just e.g. validate the JSON schema or otherwise operate in a generic way just based on the JSON schema, they could indeed just programmatically attempt to fetch the JSON schema from the zarr-extensions repo.

jbms avatar Oct 13 '25 23:10 jbms

Thanks for the feedback @jbms, and apologies for not responding to your comments earlier. Would you be available and willing to chat sometime about these design decisions? I've put zarr conventions co-working time on the Zarr community calendar for this week, or would be glad to setup a separate tim. I know you have a ton of valuable insight and think a call be the quickest way to fully understanding your points.

maxrjones avatar Nov 19 '25 01:11 maxrjones