pystac-client icon indicating copy to clipboard operation
pystac-client copied to clipboard

Issue with `get_collections`

Open system123 opened this issue 3 years ago • 8 comments

I am seeing a problem where if I call get_collections directly it is very slow and only the first page of the collections are returned. If I reimplement the get_collections function it runs almost instantaneously and returns the entire list of collections.

Here is a minimal example:

This takes almost 30 seconds to run and returns only the first page of collections (10 in total).

from pystac_client import Client

STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
catalog = Client.open(f'{STAC_URL}/OB_DAAC')

for c in catalog.get_collections():
    print(c)

This is the re-implementation which runs as expected:

from pystac_client import Client, CollectionClient 
from pystac_client._utils import Modifiable, call_modifier 

STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
catalog = Client.open(f'{STAC_URL}/OB_DAAC')

def get_collections(catalog):
    url = f"{catalog.get_self_href()}/collections"

    for page in catalog._stac_io.get_pages(url):
        if "collections" not in page:
            continue

        for col in page["collections"]:
            collection = CollectionClient.from_dict(
                            col, root=catalog, modifier=catalog.modifier
                        )
            call_modifier(catalog.modifier, collection)
            yield collection

for col in get_collections(catalog):
    print(col)

This implementation is basically copy pasted from the code, so I am not sure why calling it directly on the class doesn't provide the same performance or output.

system123 avatar Sep 21 '22 14:09 system123

I think that https://cmr.earthdata.nasa.gov/stac/OB_DAAC is missing a /collections in its conformsTo, so pystac-client doesn't know that the /collections endpoint exists.

If "https://api.stacspec.org/v1.0.0-rc.1/collections", were added to that conformsTo, then things should be as fast as expected. @system123 do you know who maintains that endpoint? (or perhaps @gadomski knows?)

TomAugspurger avatar Sep 21 '22 15:09 TomAugspurger

For APIs that don't advertise their collections endpoint, you can use the ignore_conformance parameter (or --ignore-conformance on the command line). This instructs the Client to assume a fully-featured API, even if conformance links are not provided.

I tested this out on the command line and here's the results. First, without --ignore-conformance:

$ time stac-client collections https://cmr.earthdata.nasa.gov/stac/OB_DAAC | jq '. | length'
10
stac-client collections https://cmr.earthdata.nasa.gov/stac/OB_DAAC  0.35s user 0.09s system 0% cpu 59.677 total

Then, with --ignore-conformance:

$ time stac-client collections --ignore-conformance https://cmr.earthdata.nasa.gov/stac/OB_DAAC | jq '. | length'
897
stac-client collections --ignore-conformance   1.33s user 0.13s system 5% cpu 28.014 total

It may be reasonable for pystac-client to (e.g.) throw a warning when iterating collections without the collections endpoint, since this is surprising but not error-worthy behavior (IMO). @philvarner @matthewhanson what do you think?

EDIT: @TomAugspurger we may maintain that endpoint but I don't know, I'm asking around.

gadomski avatar Sep 21 '22 15:09 gadomski

@TomAugspurger that endpoint is maintained by the NASA CMR team.

gadomski avatar Sep 21 '22 15:09 gadomski

Thanks. I opened https://github.com/nasa/cmr-stac/issues/236 to track that. We can leave this open to address whether pystac-client should warn when taking the slow path. I don't have a strong preference, as long as there's a relatively easy way to silence those warnings (like a parameter when creating the catalog).

TomAugspurger avatar Sep 21 '22 15:09 TomAugspurger

I think a warning would be a good idea. Only implementing Core & Item Search happens, so calling collections on such an API could be surprisingly slow for the user if there's a deeply-nested tree of collections.

philvarner avatar Sep 21 '22 16:09 philvarner

@gadomski Thanks, adding ignore_conformance=True solved the issue for me – although I do think having a warning suggesting to use ignore_conformance=True could be useful as I missed this in the documentation and the fact that the collections endpoint does exist the slower performance seemed strange.

@TomAugspurger Thanks for opening an issue on NASA CMR.

system123 avatar Sep 22 '22 08:09 system123

I'll also say that the meaning of ignore_conformance is a bit confusing, since that doesn't point to the behavior that happens. Maybe something more like use_collections_endpoint would be clearer.

Obviously some issues with changing the semantics now, but I think these heuristic behaviors (like /collections vs child links) are problematic because they do things that aren't clear to the user. If we didn't already have the behavior, I'd say that it should use the Features or Collections conformance class by default or if there's a data link rel but those aren't advertised, but require the user to set a flag(s) like use_collections_endpoint=True or browse_for_collections=True otherwise. That way, it's explicit that the client is using a slower behavior.

philvarner avatar Sep 22 '22 13:09 philvarner

Maybe something more like use_collections_endpoint would be clearer.

Agreed. In general, I'd like to see granular configuration (probably in a config-style object) where you can enable/disable(/warn/error) various heuristics, since this type of issue seems to keep popping up (e.g. related to https://github.com/stac-utils/pystac-client/issues/136 you could force the client to use the items endpoint even if its not advertised).

gadomski avatar Sep 22 '22 15:09 gadomski

Ok just to make sure this comes full circle now that #480 is in (and there should probably be more docs on this).

First you try running it naively:

In [1]: from pystac_client import Client
   ...: 
   ...: STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
   ...: catalog = Client.open(f'{STAC_URL}/OB_DAAC')
   ...: 
   ...: for c in catalog.get_collections():
   ...:     print(c)
   ...: 
/home/jsignell/pystac-client/pystac_client/client.py:399: DoesNotConformTo: Server does not conform to COLLECTIONS, FEATURES
  self._warn_about_fallback("COLLECTIONS", "FEATURES")
/home/jsignell/pystac-client/pystac_client/client.py:399: FallbackToPystac: Falling back to pystac. This might be slow.
  self._warn_about_fallback("COLLECTIONS", "FEATURES")
<CollectionClient id=Turbid9.v0>
<CollectionClient id=GreenBay.v0>
<CollectionClient id=Catlin_Arctic_Survey.v0>
... # this goes slowly

Then you notice the warnings about conformance and falling back to pystac so you do add_conforms_to

In [2]: from pystac_client import Client
   ...: 
   ...: STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
   ...: catalog = Client.open(f'{STAC_URL}/OB_DAAC')
   ...: catalog.add_conforms_to("COLLECTIONS")
   ...: 
   ...: for c in catalog.get_collections():
   ...:     print(c)
   ...: 
/home/jsignell/pystac-client/pystac_client/client.py:601: MissingLink: No link with rel='data' could be found on this Client.
  href = self._get_href("data", data_link, "collections")
<CollectionClient id=Turbid9.v0>
<CollectionClient id=GreenBay.v0>
<CollectionClient id=Catlin_Arctic_Survey.v0>
... # this goes fast

jsignell avatar May 04 '23 18:05 jsignell