pystac-client icon indicating copy to clipboard operation
pystac-client copied to clipboard

Different numbers of STAC items returned between version `0.7.3` and `>=0.7.4`

Open robbibt opened this issue 1 year ago • 1 comments

Hi all, we've recently had users encounter issues with missing items returned via our Digital Earth Australia STAC API (https://explorer.sandbox.dea.ga.gov.au/stac). The data does exist, however only a small proportion of matching STAC items are being returned by pystac_client.

Looking into this, this issue appears to occur only on the most recent versions of pystac_client - version 0.7.3 and earlier return all relevant results for a query as expected.

pystac_client version 0.7.3

For example, on pystac_client==0.7.3, the query works perfectly and returns the expected 37 resulting Sentinel-2 scenes:

import pystac_client

client = pystac_client.Client.open("https://explorer.sandbox.dea.ga.gov.au/stac")

collections = ["ga_s2am_ard_3", "ga_s2bm_ard_3"]
query = client.search(
    collections=collections,
    bbox=[146.04, -34.30, 146.05, -34.28],
    datetime="2023-12-01/2024-02-28",
)

len([i.properties["datetime"] for i in query.items()]) 

image

pystac_client version 0.7.4 and above

However, on pystac_client==0.7.4 and above, only 20 items are returned for exacly the same query: image

(A workaround is to provide a high limit manually (e.g. limit=1000) - however this feels unnecessary and is not something our users have had to do in the past)

In case it's useful, our STAC API implementation is located here: https://github.com/opendatacube/datacube-explorer/blob/develop/cubedash/_stac.py

robbibt avatar Mar 01 '24 01:03 robbibt

In v0.7.4 we removed the default limit from pystac-client (https://github.com/stac-utils/pystac-client/pull/584) because it makes more sense to trust the server's default limit. You noticed the change because pagination appears to be broken for your API:

$ cat data.json 
{
    "bbox": [
        146.04,
        -34.3,
        146.05,
        -34.28
    ],
    "datetime": "2023-12-01T00:00:00Z/2024-02-28T23:59:59Z",
    "collections": [
        "ga_s2am_ard_3",
        "ga_s2bm_ard_3"
    ]
}
$ curl -s -X POST https://explorer.sandbox.dea.ga.gov.au/stac/search --json @data.json | jq '.links[0]'
{
  "rel": "next",
  "href": "https://explorer.sandbox.dea.ga.gov.au/stac/search?collections=ga_s2am_ard_3&collections=ga_s2bm_ard_3&bbox=146.04,-34.3,146.05,-34.28&time=2023-12-01T00%3A00%3A00%2B00%3A00%2F2024-02-28T23%3A59%3A59%2B00%3A00&limit=20&_o=20&_full=True"
}
$ curl -s https://explorer.sandbox.dea.ga.gov.au/stac/search\?collections\=ga_s2am_ard_3\&collections\=ga_s2bm_ard_3\&bbox\=146.04,-34.3,146.05,-34.28\&time\=2023-12-01T00%3A00%3A00%2B00%3A00%2F2024-02-28T23%3A59%3A59%2B00%3A00\&limit\=20\&_o\=20\&_full\=True | jq '.features | length'
0

One note: it's surprising to me that the next link would be a GET url, when the original search request came in as a POST.

gadomski avatar Mar 01 '24 14:03 gadomski

Closing as not-a-pystac-client-issue.

gadomski avatar May 09 '24 12:05 gadomski