esgf-pyclient icon indicating copy to clipboard operation
esgf-pyclient copied to clipboard

No consistent result from different queries (CMIP6)

Open susopeiz opened this issue 4 months ago • 1 comments

I'm trying to query all available CMIP6 projections for selected models, scenarios and variables but I'm getting different results depending on the additional parameters I use in the query. I write an example for clarity

I connect to the German data center

from pyesgf.logon import LogonManager
from pyesgf.search import SearchConnection

hostname = "esgf-data.dkrz.de"

lm = LogonManager()
lm.logon(
    hostname=hostname,
    bootstrap=True,
    username=username,
    password=password,
    interactive = False
)

url = "http://{}/esg-search".format(hostname)
conn = SearchConnection(url, distrib=True)

I query all available projections for the CanESM5 model, scenario ssp245, variable zg500 and member_id r1i1p1f1

fields = {
    "project": "CMIP6",
    "frequency": "day",
    "variable": "zg500",
    "source_id": "CanESM5",
    "member_id": "r1i1p1f1",
    "experiment_id": "ssp245"
}

ctx = conn.new_context(**fields)
counts = ctx.hit_count
results = ctx.search()
print(f'Number of counts: {counts}')

print(f'\nFiles found:')
for r in results:
    print(r.dataset_id)

And get in theory 0 counts (as provided by ctx.hit_count()) but results show there are 2 instances for this specific query (one in the Canadian server and one in the American server)

Number of counts: 0

Files found:
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.AERday.zg500.gn.v20190429|crd-esgf-drc.ec.gc.ca
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.AERday.zg500.gn.v20190429|esgf-data1.llnl.gov

If I query without the restriction of the specific member_id (r1i1p1f1)

fields = {
    "project": "CMIP6",
    "frequency": "day",
    "variable": "zg500",
    "source_id": "CanESM5",
    #"member_id": "r1i1p1f1",
    "experiment_id": "ssp245",
}

ctx = conn.new_context(**fields)
counts = ctx.hit_count
results = ctx.search()
print(f'Number of counts: {counts}')

print(f'\nFiles found:')
for r in results:
    if "r1i1p1f1" in r.dataset_id:
        print(r.dataset_id)

I now get 4 counts (¿?), but the same instances than before.

Number of counts: 4

Files found:
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.AERday.zg500.gn.v20190429|crd-esgf-drc.ec.gc.ca
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.AERday.zg500.gn.v20190429|esgf-data1.llnl.gov

However, if I extend the query to include pr in addition to zg500...

fields = {
    "project": "CMIP6",
    "frequency": "day",
    "variable": ["pr", "zg500"],
    "source_id": "CanESM5",
    #"member_id": "r1i1p1f1",
    "experiment_id": "ssp245",
}

ctx = conn.new_context(**fields)
counts = ctx.hit_count
results = ctx.search()
print(f'Number of counts: {counts}')

print(f'\nFiles found:')
for r in results:
    if "r1i1p1f1" in r.dataset_id:
        print(r.dataset_id)

I get no available instances for zg500 anymore (I do get some for pr though)

Number of counts: 54

Files found:
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.pr.gn.v20190429|esgf3.dkrz.de
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.pr.gn.v20190306|esgf.ceda.ac.uk
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.pr.gn.v20190429|esgf.ceda.ac.uk
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.pr.gn.v20190429|esgf.nci.org.au
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.pr.gn.v20190306|esgf.nci.org.au

Or for example if I extend to different number of scenarios:

fields = {
    "project": "CMIP6",
    "frequency": "day",
    "variable": "zg500",
    "source_id": "CanESM5",
    #"member_id": "r1i1p1f1",
    "experiment_id": ["historical","ssp245","ssp585"],
}

ctx = conn.new_context(**fields)
counts = ctx.hit_count
results = ctx.search()
print(f'Number of counts: {counts}')

print(f'\nFiles found:')
for r in results:
    if "r1i1p1f1" in r.dataset_id:
        print(r.dataset_id)

I get no results for ssp's but only 1 for the historical.

Number of counts: 15

Files found:
CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.AERday.zg500.gn.v20190429|crd-esgf-drc.ec.gc.ca

Is this behaviour expected? It is required for my analysis to be able to do cross-parameters searches so I can identify which simulations are available across a certain list of variables and scenarios.

susopeiz avatar Feb 19 '24 09:02 susopeiz