cmr-stac
cmr-stac copied to clipboard
Searching for Collections that Don't Exist Times Out/Takes a Very Long Time
The Harmomised Landsat and Sentinel-2 Collections changed their names, and I didn't realise that.
Code that used to work quickly, not took 10s of minutes to perhaps not complete.
Example code:
from pystac.client import Client
from odc.geo import BoundingBox
catalog = Client.open("https://cmr.earthdata.nasa.gov/cloudstac/LPCLOUD/")
# Bounding box over western Tasmania
bbox = BoundingBox(145.0, -43.0, 146.3, -42.0, crs="EPSG:4326")
# Simpler search across two collections
items = catalog.search(
collections=["HLSS30_v2.0"],
bbox=list(bbox),
datetime="2025-04",
).item_collection()
print(f"Found {len(items)} items")
Note that the collection should be HLSS30_2.0.
When working right, with the correct name, this query takes about 7 seconds. With the incorrect name, it would take over 5 minutes before I'd kill it.
I think that this API should handle incorrect/non-existent collection names by rapidly returning 0 items.
Just a note that when searching for a collection that doesn't exist, the API returns 5,500 items after 30 minutes, whereas, using the name of a collection that does exist returns around 30 items in 5 seconds.
Hi @alexgleith, regarding the suggestion of collection validation within the search method:
While validating collections directly in search seems intuitive, initial testing reveals a potential performance impact. Integrating a check like this:
if collections:
valid_collections = {col.id for col in self.get_collections()}
invalid_collections_set = set(collections) - valid_collections
if invalid_collections_set:
raise ValueError(f"Invalid collections: {list(invalid_collections_set)}")
increased my test execution time from ~7-8 seconds to ~21 seconds.
Caching the valid collections in the Client class could reduce repeated calls to self.get_collections(), but the initial collection retrieval overhead would remain.
A more immediate workaround for users needing this validation is to implement it externally using Client.get_collection() or Client.get_collections(), similar to the snippet above.
Curious to hear the maintainers' thoughts on balancing this feature with potential performance considerations.
Maybe there needs to be an index on the collection names...
I don't know why the CMR STAC API would return items for a collection name that doesn't exist too, though?