earthaccess icon indicating copy to clipboard operation
earthaccess copied to clipboard

Collection search uses paging as opposed to search-after header

Open doug-newman-nasa opened this issue 1 year ago • 22 comments

The following code uses page_size and page number to iterate through results: https://github.com/nsidc/earthaccess/blob/8fe60974ce0f6e5d6f8fbec679afb96f12f1506f/earthaccess/search.py#L282C13-L282C64 Limit could be set to a value sufficiently high to cause CMR problems. Search-After is used to combat this.

Documentation on search-after: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#search-after

See similar fix to another CMR python library here: https://github.com/nasa/python_cmr/commit/1702100ab3f8dcb4843ab9da184612b372ec1854

doug-newman-nasa avatar Mar 05 '24 03:03 doug-newman-nasa

I plan on fixing this in the same way I did with nasa/python_cmr.

doug-newman-nasa avatar Mar 05 '24 03:03 doug-newman-nasa

Hi Doug! thanks for reporting this. We do use search-after for granule search since those results will be in the thousands/millions.

https://github.com/nsidc/earthaccess/blob/8fe60974ce0f6e5d6f8fbec679afb96f12f1506f/earthaccess/search.py#L645

That being said, it would be good practice to also use it for collections, I think that's the line you're referring to.

betolink avatar Mar 05 '24 04:03 betolink

Great catch! Luis resolved this long ago for granules (#145), and collection search didn't even cross my mind at the time :laughing: It's just not something I normally do I guess.

mfisher87 avatar Mar 05 '24 12:03 mfisher87

Yeah, I got thrown because you have two classes in the same file! I would suggest factoring out the granule code to be used in the dataset code too since it should be the same except for marshaling the results.

doug-newman-nasa avatar Mar 05 '24 21:03 doug-newman-nasa

Looking more closely at the granule get method, Search-After is only used if the format of the requested data is umm_json? https://github.com/nsidc/earthaccess/blob/8fe60974ce0f6e5d6f8fbec679afb96f12f1506f/earthaccess/search.py#L642

doug-newman-nasa avatar Mar 05 '24 23:03 doug-newman-nasa

:eyes: Not sure why that's like that! The whole conditional is a bit confusing to me. I think we badly need a refactor on that piece of core code :grin:

mfisher87 avatar Mar 06 '24 00:03 mfisher87

@doug-newman-nasa would you be interested in attending one of our bi-weekly hackathons?

mfisher87 avatar Mar 06 '24 00:03 mfisher87

Yeah, this is mainly because earthaccess has only parsers for umm_json. Initially I wanted to write parsers for json, iso19115 and echo10 but I've always run into irregularities with the metadata and umm_json has been the most predictable flavor of the schemas.

betolink avatar Mar 06 '24 00:03 betolink

So, the reason I'm 'looking' here is whenever I find myself in a repo for a python wrapper for CMR I immediately check they aren't deep-paging. You aren't where it counts (granule) but are in other places (collection). But it pointed out some redundancy in your two classes. I'd like to tackle some refactoring while addressing the main issue. Discussing that at a bi-weekly hackathon would be extremely useful (at least to me).

doug-newman-nasa avatar Mar 06 '24 00:03 doug-newman-nasa

@betolink Do you think the next step is to remove the temporary (?) code for the other parsers? Or start considering writing parsers for the other available CMR formats?

... it pointed out some redundancy in your two classes. I'd like to tackle some refactoring while addressing the main issue. Discussing that at a bi-weekly hackathon would be extremely useful (at least to me).

:bow: This will be so immensely appreciated! It's hard to get refactoring on core code like that done because we have many high-demand features and bugs splitting our attention! That doesn't make the refactors any less important, though. Looking forward to seeing you in two weeks!

mfisher87 avatar Mar 06 '24 00:03 mfisher87

@mfisher87 I'm leaning towards just supporting umm_json for now and when the time comes we can map CMR responses to the proper results parsers. This brings me to 2 related topics,

  • We use umm_json because of its completeness and consistency but, could the same be achieved by GraphQL queries that in theory are faster?
  • If we refactor the way we handle response it would be interesting to follow what pystac-client does, the results are wrapped into a data structure that contains stac items, we could do something similar. This could bring UX improvements to the way we present the data.

betolink avatar Mar 06 '24 02:03 betolink

GraphQL queries that in theory are faster?

GraphQL might be faster if you are doing multiple queries. The primary utility of GraphQL is reducing the number of queries by the client but there is still potentially 'multiple queries' going on behind GraphQL. If you are getting everything you need from umm_json then GraphQL isn't going to speed things up. It's still taking your query and querying the CMR API. But it might have utility elsewhere.

the results are wrapped into a data structure that contains stac items

If you want stac items why not use the CMR STAC API?

doug-newman-nasa avatar Mar 06 '24 03:03 doug-newman-nasa

We do not necessarily need the stac items as such, just the way the response is handled. eg.

results = earthaccess.search_data(**params)

Right now, results is a list of parsed umm_json items. Each instance of this list is an enhanced Python dictionary with some convenient custom methods like data_links() etc.

I'm thinking of refactoring this in a way that results will be a wrapped on a class that contains the same umm_json items but the handy methods can operate on the entire list. The semantics are a bit different but that will save users all those for loops to collect links from the results or filter granules by a particular criteria etc Maybe this is not as urgent, I just thought it could be interesting to explore! this is the class in pystac-client: https://pystac-client.readthedocs.io/en/latest/api.html#pystac_client.ItemSearch

betolink avatar Mar 06 '24 03:03 betolink

Just to check if we're on the same page, you're thinking a search would return a Results object containing Granule objects and having some special methods that can make operations on the whole list easier? I've been thinking this might be useful as well. E.g. results.get_links()?

mfisher87 avatar Mar 06 '24 13:03 mfisher87

@mfisher87 correct! we could also implement pagination like in the stac search results results.next_page() if we don't want to load all the results in one go.

betolink avatar Mar 06 '24 15:03 betolink

I love that. Or we could promote a generator usage pattern? next(results)? Or to get them all list(results)!

mfisher87 avatar Mar 06 '24 16:03 mfisher87

Started work on this. What I want to do is add VCR to your testing. I think I only see one test that actually queries CMR: test_data_links. So there is nothing that really tests collection search url construction or results parsing at the collection level. Using VCR we can test this and remove the need to hit CMR each time the test is run. You are also relying on the result contents returned by test_data_links not changing in the future. Thoughts?

doug-newman-nasa avatar Mar 14 '24 21:03 doug-newman-nasa

TIL VCR! I'm having one of those "how have I not heard of this?" moments ;) That's a really cool idea. I'm all for this! Would you be using the betamax implementation?

mfisher87 avatar Mar 15 '24 00:03 mfisher87

cc @danielfromearth this may be helpful for unit tests for #426

mfisher87 avatar Mar 15 '24 00:03 mfisher87

I'm using https://vcrpy.readthedocs.io/en/latest I should have a PR ready today.

doug-newman-nasa avatar Mar 15 '24 11:03 doug-newman-nasa

PR submiited: https://github.com/nsidc/earthaccess/pull/494

doug-newman-nasa avatar Mar 15 '24 14:03 doug-newman-nasa

I can't stress enough how much I feel this has improved our tests and will improve our testing practices going forward. Thank you! :100:

mfisher87 avatar Mar 15 '24 15:03 mfisher87

Fixed by #494

chuckwondo avatar May 09 '24 15:05 chuckwondo