cmr-stac icon indicating copy to clipboard operation
cmr-stac copied to clipboard

Searching for Collections that Don't Exist Times Out/Takes a Very Long Time

Open alexgleith opened this issue 7 months ago • 3 comments

The Harmomised Landsat and Sentinel-2 Collections changed their names, and I didn't realise that.

Code that used to work quickly, not took 10s of minutes to perhaps not complete.

Example code:

from pystac.client import Client
from odc.geo import BoundingBox

catalog = Client.open("https://cmr.earthdata.nasa.gov/cloudstac/LPCLOUD/")

# Bounding box over western Tasmania
bbox = BoundingBox(145.0, -43.0, 146.3, -42.0, crs="EPSG:4326")

# Simpler search across two collections
items = catalog.search(
    collections=["HLSS30_v2.0"],
    bbox=list(bbox),
    datetime="2025-04",
).item_collection()

print(f"Found {len(items)} items")

Note that the collection should be HLSS30_2.0.

When working right, with the correct name, this query takes about 7 seconds. With the incorrect name, it would take over 5 minutes before I'd kill it.

I think that this API should handle incorrect/non-existent collection names by rapidly returning 0 items.

alexgleith avatar Apr 21 '25 23:04 alexgleith

Just a note that when searching for a collection that doesn't exist, the API returns 5,500 items after 30 minutes, whereas, using the name of a collection that does exist returns around 30 items in 5 seconds.

Image

alexgleith avatar Apr 22 '25 00:04 alexgleith

Hi @alexgleith, regarding the suggestion of collection validation within the search method:

While validating collections directly in search seems intuitive, initial testing reveals a potential performance impact. Integrating a check like this:

if collections:
    valid_collections = {col.id for col in self.get_collections()}
    invalid_collections_set = set(collections) - valid_collections
    if invalid_collections_set:
        raise ValueError(f"Invalid collections: {list(invalid_collections_set)}")

increased my test execution time from ~7-8 seconds to ~21 seconds.

Caching the valid collections in the Client class could reduce repeated calls to self.get_collections(), but the initial collection retrieval overhead would remain.

A more immediate workaround for users needing this validation is to implement it externally using Client.get_collection() or Client.get_collections(), similar to the snippet above.

Curious to hear the maintainers' thoughts on balancing this feature with potential performance considerations.

FaitAccompli avatar May 09 '25 15:05 FaitAccompli

Maybe there needs to be an index on the collection names...

I don't know why the CMR STAC API would return items for a collection name that doesn't exist too, though?

alexgleith avatar May 09 '25 20:05 alexgleith