pystac-client icon indicating copy to clipboard operation
pystac-client copied to clipboard

Add request splitting feature

Open matthewhanson opened this issue 4 years ago • 1 comments

It's useful to be able to take a large request with many pages into smaller requests so they can be run in parallel or asynchronously. Datetime is convenient way to split up requests.

The code in the Cirrus stac-api feeder code is one example of doing this with sat-search.

The ItemSearch class should have a function(s) to return an array of Request objects. The simplest of these can just take in a num_requests parameter that will take the datetime range and split it into num_requests requests.

A more complicated function would take in the approximate number of Items to return in each request. A series of requests can be made to get number of hits to divide up the requests roughly equally (as in the above code link).

matthewhanson avatar Jul 30 '21 06:07 matthewhanson

https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb has an example where we want to query many (100,000) points for not too long of a date range. In this case, it's more important / necessary to parallelize by space rather than time.

https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb#Option-2:-Parallelize...-carefully. lays out the approach of using a Hilbert curve to spatially partition the points before (manually) making multiple requests to the item search endpoint. I don't know if that would be appropriate for pystac-client or not.

It also hints that async requests could help with performance (https://github.com/stac-utils/pystac-client/issues/4). We spend ~90% of our time waiting on I/O, and the rest on parsing JSON, so we're mostly (but not entirely) IO bound / non-blocking.

TomAugspurger avatar Sep 03 '21 19:09 TomAugspurger