stac-geoparquet icon indicating copy to clipboard operation
stac-geoparquet copied to clipboard

[RFC] Multiple collections in metadata

Open gadomski opened this issue 9 months ago • 14 comments

The spec currently describes how to store a single collection in the metadata: https://github.com/stac-utils/stac-geoparquet/blob/121f6485fd188179f1c05d53ec26c69d2518c683/spec/stac-geoparquet-spec.md?plain=1#L72-L77

As discussed in https://github.com/stac-utils/stac-rs/issues/428, there's (at least one) another use-case for stac-geoparquet: the output of a STAC API search. In that case, you might have multiple collections represented. While they may have widely varying schema, they might not if the fields extension is used[^1].

Proposal

An optional stac:collections metadata field that is a JSON string describing a ~list of collections, just like returned from the collections endpoint of a STAC API~ mapping of collections, with collection.id as the key (updated per comments below):

{
  "stac:collections": "{\"collection-a\":{\"type\":\"Collection\",\"id\": \"collection-a\", ...},\"collection-b\":{\"type\":\"Collection\",\"id\": \"collection-b\", ...}}"
}

It would be an error (invalid stac-geoparquet) if both stac:collection and stac:collections are set.

Alternatives and options

  • ~stac:collections could be an object, not a list, with the keys being collection ids. Upside: easier to get the collection you want for a given item. Downside: different than how STAC API does it~ Moved ☝ to the proposal after comments
  • We could deprecate stac:collection and remove it in a stac-geoparquet v2.0

[^1]: The fields extension works nicely with columnar data stores (such as parquet) and so we will be advocating its usage more as we push adoption of stac-geoparquet

gadomski avatar Mar 04 '25 12:03 gadomski

Thanks for the suggestion. Generally, this sounds like a good proposal. I think the main thing to hash out is how to handle the existing stac:collection field.

  1. Keeping both isn't super elegant, but maybe is OK if we give good guidance to clients about how to handle them (though I'm not sure what to recommend. And we open up the potential for conflicting metadata between stac:collection and stac:collections).
  2. What does deprecating something in the spec mean? I guess JSON schema has the concept of a deprecated field (https://json-schema.org/understanding-json-schema/reference/annotations). And we could instruct clients to migrate to the new stac:collections field.
  3. Re: object vs. array for stac:collections, I think using an array and following the STAC API here is for the best.

Overall, I'd suggest introducing stac:collections as a new, optional field in a minor update to the spec. Then we can follow up with a decision on if / how to deprecate stac:collection once we see it in use.

Edit: I missed a maybe important point: that you suggest erroring if both stac:collection and stac:collections are set. I'll have to think that through. It removes any issues with clients having to search both and pick between them, potential for metadata conflict between the two. But does it make migration harder? I guess not... if you have a stac-geoparquet dataset with multiple collections, you already aren't able to use stac:collection. So yeah, disallowing both is probably for the best.

TomAugspurger avatar Mar 04 '25 14:03 TomAugspurger

I actually think that returning an object with collection name = key would be far more useful. The entire reason that you want to have the collections data as part of the return is to have easy access to look up metadata from a collection for a given item. I don't see any reason that we need to/should stick to the collections search return format.

bitner avatar Mar 04 '25 14:03 bitner

Does anyone know why the STAC API decided to use a list instead of an dict for collections?

If we go with a dict then we automatically prevent duplicates in stac:collections (which is a good thing IMO). We would need to decide if the keys are meaningful, and if so what they're allowed to be. Probably we just require that the key matches ~collection.name~ collection.id?

TomAugspurger avatar Mar 04 '25 15:03 TomAugspurger

@TomAugspurger I think this is a stac-fastapi (i.e. implementation) specific addition to the response, that's not defined in STAC API afaik.

m-mohr avatar Mar 04 '25 15:03 m-mohr

From https://github.com/radiantearth/stac-api-spec/tree/release/v1.0.0/ogcapi-features#endpoints, the /collections endpoint is described as:

Object containing an array of Collection objects in the Catalog, and Link relations

gadomski avatar Mar 04 '25 15:03 gadomski

Probably we just require that the key matches collection.name?

Why not collection.id?

gadomski avatar Mar 04 '25 15:03 gadomski

Because my STAC is getting rusty and I forgot that collection.id is a thing 😆. I'll edit my suggestion to be stac.id since that's obviously better.

TomAugspurger avatar Mar 04 '25 16:03 TomAugspurger

From https://github.com/radiantearth/stac-api-spec/tree/release/v1.0.0/ogcapi-features#endpoints, the /collections endpoint is described as:

Weren't we talking about the search endpoint here? @gadomski

Why would you need a list of collection in the first place for results of a collection-specific item list?

I mean it is obvious why /collections uses an array instead of an object: An object isn't ordered, so sorting wouldn't work at all.

m-mohr avatar Mar 04 '25 16:03 m-mohr

@m-mohr The use case for this is for the search endpoint that may contain any number of collections for the items it returns, but this is for the spec for stac-geoparquet which could be the result of a collection-specific item list or a possibly multi-collection search result, so it needs to be able to be a container for either.

And yes, /collections uses an array due to the ordering issue, while within a stac-geoparquet, ordering isn't important, it's just being able to access that metadata which is why I think an object makes much more sense than an array.

bitner avatar Mar 04 '25 16:03 bitner

Okay. I agree with the object proposal.

m-mohr avatar Mar 04 '25 17:03 m-mohr

OK, so object it is.

And do people have a preference for deprecating stac:collection vs. having them both in the spec indefinitely (with the requirement that they aren't both set in the same file)?

TomAugspurger avatar Mar 05 '25 02:03 TomAugspurger

Late to the party, but I have a strong preference for deprecating stac:collection. Allowing stac:collection or stac:collections would imply that the metadata keys depend on the contents of the catalog. So if you did the same request to two different catalogs you could get different metadata keys. That feels icky to me.

jsignell avatar Mar 11 '25 13:03 jsignell

I agree, this spec it still in it's relatively early days so let's make it clean right from the start. We always regretted keeping old stuff from the early days in STAC when we had a fear to break things (looking e.g. at you eo/raster:bands). It's much more difficult to clean up in a couple of years.

m-mohr avatar Mar 11 '25 15:03 m-mohr

Since we haven't technically "released" anything yet, and I haven't seen too many implementations in the wild, would it make sense to remove stac:collection altogether before tagging a v1.0.0?

gadomski avatar Mar 11 '25 15:03 gadomski