[RFC] Multiple collections in metadata
The spec currently describes how to store a single collection in the metadata: https://github.com/stac-utils/stac-geoparquet/blob/121f6485fd188179f1c05d53ec26c69d2518c683/spec/stac-geoparquet-spec.md?plain=1#L72-L77
As discussed in https://github.com/stac-utils/stac-rs/issues/428, there's (at least one) another use-case for stac-geoparquet: the output of a STAC API search. In that case, you might have multiple collections represented. While they may have widely varying schema, they might not if the fields extension is used[^1].
Proposal
An optional stac:collections metadata field that is a JSON string describing a ~list of collections, just like returned from the collections endpoint of a STAC API~ mapping of collections, with collection.id as the key (updated per comments below):
{
"stac:collections": "{\"collection-a\":{\"type\":\"Collection\",\"id\": \"collection-a\", ...},\"collection-b\":{\"type\":\"Collection\",\"id\": \"collection-b\", ...}}"
}
It would be an error (invalid stac-geoparquet) if both stac:collection and stac:collections are set.
Alternatives and options
- ~
stac:collectionscould be an object, not a list, with the keys being collection ids. Upside: easier to get the collection you want for a given item. Downside: different than how STAC API does it~ Moved ☝ to the proposal after comments - We could deprecate
stac:collectionand remove it in a stac-geoparquet v2.0
[^1]: The fields extension works nicely with columnar data stores (such as parquet) and so we will be advocating its usage more as we push adoption of stac-geoparquet
Thanks for the suggestion. Generally, this sounds like a good proposal. I think the main thing to hash out is how to handle the existing stac:collection field.
- Keeping both isn't super elegant, but maybe is OK if we give good guidance to clients about how to handle them (though I'm not sure what to recommend. And we open up the potential for conflicting metadata between
stac:collectionandstac:collections). - What does deprecating something in the spec mean? I guess JSON schema has the concept of a deprecated field (https://json-schema.org/understanding-json-schema/reference/annotations). And we could instruct clients to migrate to the new
stac:collectionsfield. - Re: object vs. array for
stac:collections, I think using an array and following the STAC API here is for the best.
Overall, I'd suggest introducing stac:collections as a new, optional field in a minor update to the spec. Then we can follow up with a decision on if / how to deprecate stac:collection once we see it in use.
Edit: I missed a maybe important point: that you suggest erroring if both stac:collection and stac:collections are set. I'll have to think that through. It removes any issues with clients having to search both and pick between them, potential for metadata conflict between the two. But does it make migration harder? I guess not... if you have a stac-geoparquet dataset with multiple collections, you already aren't able to use stac:collection. So yeah, disallowing both is probably for the best.
I actually think that returning an object with collection name = key would be far more useful. The entire reason that you want to have the collections data as part of the return is to have easy access to look up metadata from a collection for a given item. I don't see any reason that we need to/should stick to the collections search return format.
Does anyone know why the STAC API decided to use a list instead of an dict for collections?
If we go with a dict then we automatically prevent duplicates in stac:collections (which is a good thing IMO). We would need to decide if the keys are meaningful, and if so what they're allowed to be. Probably we just require that the key matches ~collection.name~ collection.id?
@TomAugspurger I think this is a stac-fastapi (i.e. implementation) specific addition to the response, that's not defined in STAC API afaik.
From https://github.com/radiantearth/stac-api-spec/tree/release/v1.0.0/ogcapi-features#endpoints, the /collections endpoint is described as:
Object containing an array of Collection objects in the Catalog, and Link relations
Probably we just require that the key matches collection.name?
Why not collection.id?
Because my STAC is getting rusty and I forgot that collection.id is a thing 😆. I'll edit my suggestion to be stac.id since that's obviously better.
From https://github.com/radiantearth/stac-api-spec/tree/release/v1.0.0/ogcapi-features#endpoints, the
/collectionsendpoint is described as:
Weren't we talking about the search endpoint here? @gadomski
Why would you need a list of collection in the first place for results of a collection-specific item list?
I mean it is obvious why /collections uses an array instead of an object: An object isn't ordered, so sorting wouldn't work at all.
@m-mohr The use case for this is for the search endpoint that may contain any number of collections for the items it returns, but this is for the spec for stac-geoparquet which could be the result of a collection-specific item list or a possibly multi-collection search result, so it needs to be able to be a container for either.
And yes, /collections uses an array due to the ordering issue, while within a stac-geoparquet, ordering isn't important, it's just being able to access that metadata which is why I think an object makes much more sense than an array.
Okay. I agree with the object proposal.
OK, so object it is.
And do people have a preference for deprecating stac:collection vs. having them both in the spec indefinitely (with the requirement that they aren't both set in the same file)?
Late to the party, but I have a strong preference for deprecating stac:collection. Allowing stac:collection or stac:collections would imply that the metadata keys depend on the contents of the catalog. So if you did the same request to two different catalogs you could get different metadata keys. That feels icky to me.
I agree, this spec it still in it's relatively early days so let's make it clean right from the start. We always regretted keeping old stuff from the early days in STAC when we had a fear to break things (looking e.g. at you eo/raster:bands). It's much more difficult to clean up in a couple of years.
Since we haven't technically "released" anything yet, and I haven't seen too many implementations in the wild, would it make sense to remove stac:collection altogether before tagging a v1.0.0?