esgf-pyclient icon indicating copy to clipboard operation
esgf-pyclient copied to clipboard

Unexpected number of results for large query

Open jbusecke opened this issue 1 year ago • 0 comments

I am exploring to use esgf-pyclient to get a list of all retracted CMIP6 datasets (for our automated maintenance of Pangeo CMIP6 cloud data.

I am trying the following:

from pyesgf.search import SearchConnection
conn = SearchConnection(
    'https://esgf-node.llnl.gov/esg-search',
    distrib=True,
)
ctx = conn.new_context(mip_era='CMIP6', retracted=True, replica=False, fields='id', facets=['doi'])
ctx.hit_count

And I get back a hit count of 691984

But when I try to extract a list of instance_ids

results = ctx.search(batch_size=10000)
retracted = [ds.dataset_id for ds in results]
len(retracted)

The list only has 240000 elements. That very even number makes me think that there is some internal limit I am hitting here?

Or did I miss something in the above code?

Any help on this would be greatly appreciated.

jbusecke avatar Mar 21 '23 19:03 jbusecke