stac-api-spec
stac-api-spec copied to clipboard
Collection Search
This is a feature-request / planning issue for adding collection-level search to the STAC API spec, if it's deemed in scope. This would mirror item-level search.
I'll leave detailed description of what this would look like to people who are more experienced with STAC and OGC API - Features, but I want to share a couple use-cases to help drive the discussion.
- Some datasets might have collection-level assets and no items (e.g. Zarr / NetCDF datasets) and might wish to expose some kind of search over many related datasets. For example, https://esgf-index1.ceda.ac.uk/search/cmip6-ceda/ shows a page where users can search over various properties of a collection of models (Institution ID, experiment ID, resolution, etc.)
- You might have a collection of related products (e.g. MODIS on GEE), and want to find the collection with some specific property (e.g. snow cover).
- In the chat @duckontheweb mentioned cataloging ML training datasets, and the ability to find collections by keyword and spatial extent. For ML training, users likely want the whole training dataset rather than individual items.
Yeah, this is definitely in scope, and has been talked about a number of times. I think the current thought is to try to get the API spec to 1.0.0 as the priority. But if people are able to work on it and there's a solid proposal then we could try to squeeze it in for 1.0.0.
My main thinking on it has always been that 'collection search' is clearly the domain of OGC CSW, and its latest incarnation is OGC API - Records, so I've personally been waiting on that, and trying to nudge it in the right direction.
I think they've been aiming for a full standard with every detail specified. I've been interested in making a stripped down version that just takes the core fields they've articulated in https://github.com/opengeospatial/ogcapi-records/blob/master/core/standard/clause_7_core.adoc#response-5 (scroll down to table 8) and the GeoJSON representations clear, and also make an 'OGC Collection Metadata' mini-spec that just says how those fields go into an OGC Collection. Then the STAC endpoint for collection search would just be an endpoint (either a special Features API collection - collections/records/, or a collection-search/ to mirror item-search) that has all the same params as the other endpoints.
I think I may have some scope to write up the 'simple' OGC content spec as part of my OGC Fellowship, and it'd be great if people started experimenting with it in STAC.
To record a case from the biweekly STAC mtg this morning:
For ML use cases, recording which type of data a collection is, e.g., source, training, testing, production, etc. A UI may then want to select only one of these to display, e.g., the production runs.
This is requested quite frequently, we discuss it often, but no one had the time yet to spec it out. I'm wondering whether we should spend some PSC money to have someone work on this specifically? cc @cholmes and other PSC members
Related issue in OGC API - Commons: https://github.com/opengeospatial/ogcapi-common/issues/69
I totally agree on the need to have a search functionality for Collections. I think that the use cases described by @TomAugspurger are pretty common.
In particular from my side I would want to be able to search collections by:
- simple text search (e.g. find all collections containing the text '10m resolution' in either title, description, keywords or other )
- 1 or more Kewords
- 1 or more providers (e.g. give me all the collections produced by ESA)
- spatial extent (e.g. give me all the collections concerning a specific zone)
- temporal extent (e.g. give me all the collections covering a specific year)
- assets (e.g. give me all collections having an asset with role 'thumbnail')
- versions (e.g. give me the latest version of a specific collection)
- 1 or more licences
The USGS search page shows through the GUI some nice possibilities to search in their catalogue (mainly made of collections). Once the collections APIs are defined they could be used in the STAC Browser for filtering existing collections. In our specific case for example we've got about 400 collections and the user has no way to find the ones more interesting for him. He can just browse the structure trying to find the catalogues that most could fit the collections he's looking for and hope to find them.
In general, while writing specifications for the collections APIs I would follow the same way used for Item search Apis with query parameters that resemble a lot the queries done through Elasticsearch APIs (with sortFields, filter, fields).
What is the status of this issue? Are there any extensions being worked on for this as we are planning to implement something of this nature.
No one had the time yet to really work on this. It would be nice to get a proposal out so happy if you could start the process.
I'm not really sure about what you need as proposal. I can write here an idea of what I would expect for such kind of search. By copying from the Itemsearch specs I would expect something as the following:
GET /collections/search
with possible fields for searching:
- bbox: exactly as defined for Itemsearch
- intersects: exactly as defined for Itemsearch
- datetime: exactly as defined for Itemsearch
- limit: exactly as defined for Itemsearch
- sub-catalogs: array of sub-catalog ids to include in the search for collections. Only collection objects in one of the provided sub-catalogs will be searched.
- collections: array of collection ids to return.
- text: simple text search over fields title, description, provider and keywords (to see to include or exclude other fields for the text search)
- fields: exactly as defined for Itemsearch
- filter: exactly as defined for Itemsearch where one could specify various fields filters through and/or conditions
- sortby: exactly as defined for Itemsearch
The result would be a list of Collections.
Of course if we have 2 searches, one for collections and one for items we will probably have to etiher differentiate urls (/collections/search and /items/search) or just keep one /search method with a mandatory parameter type that can have as value "items" or "collections" in order to understand on what type of objects to search.
Let me know if you need something different to start from.
I don't think it needs a separate endpoint such as GET /collections/search
. I think GET /collections
can simply be extended to support the additional query parameters etc
I've just created a repo for Collection Search so that we can create and discuss issues there: https://github.com/stac-api-extensions/collection-search
I just thought it should be similar to Items search.
It is not very intuitive to have /search meaning to search in items and /collections with additional query parameters for searching in collections. I think the user would get confused.
I guess that's the legacy we need to live with, better might have been a top-level /items
. But I don't see a good reason why we should add a /collections/search
. It doesn't resolve the ambiguity of /search
and it would clash with the /collections/:id
path. We should also ask OGC what they would use...
One downside of only supporting GET /collections
with parameters like the /collections/{c_id}/items
endpoint has now is that POST with a large geojson intersects would not be allows.
FWIW, the actual path of the endpoint shouldn't matter, since clients should be picking it up from the Landing Page links via a link relation anyway. I'd be in favor of /collection-search
with a custom link rel of something like https://api.stacspec.org/v1.0.0-rc.1/extensions/collection-search/rel/search
.
Hmm, why is it not allowed? It conflicts with Transaction, but on the other hand, you can (in theory) do content negotiation to avoid the conflict.
Why do we require Item Search to be at /search when it is available via links anyway? @philvarner
I pushed up a very lightweight and high-level description of a potential Collection Search README. Written in like 30mins, so feel free to discuss any changes and things that doesn't make sense. PRs welcome. There's likely a lot. For now, I just used /search/collections
as the endpoint, but also asked @pvretano for his thoughts because I still think GET /collections
would naturally be the best choice.
https://github.com/stac-api-extensions/collection-search/blob/main/README.md
Hmm, why is it not allowed? It conflicts with Transaction, but on the other hand, you can (in theory) do content negotiation to avoid the conflicts.
It would conflict with a Collections Transaction extension (though not the Item Transaction extension), which I think we want to do. I think requring content negotiation makes this too complex.
Why do we require Item Search to be at /search when it is available via links anyway? @philvarner
I brought this up in the past, and I think the resolution was that explicitly defining it to be /search makes the openapi definition feasible. But, we could say that that endpoint name is just an example of what could be used.
I think we could supporting GET /collections and GET & POST /search-collections (as indicated specifically by a link rel)
So OGC API is using GET /collections only, no POST for larger payloads.
So if we want to inherit from them, we need to do GET /collections and the POST equivalent for searching would be an issue that we need to solve in STAC ourselves, e.g. via content negotiation.
This also means you re-use the "data" relation type and you can use conformance classes to detect whether it supports additional queries etc.
See https://github.com/stac-api-extensions/collection-search/issues/2 for details...
I'd propose continuing discussions about Collection Search in https://github.com/stac-api-extensions/collection-search/issues to streamline the discussion around more specific issues.