xMIP
xMIP copied to clipboard
WIP: Extracting References from xarray metadata
The goal here is to get the reference info (DOI) for a bunch of data that was used for a publication from the Pangeo-ESGF CMIP6 Zarr Data 2.0.
I remembered and found online that the relevant attribute is tracking_id which usually has values that look like 'hdl:xxx/xxx'. I had trouble finding any info on how to lookup the metadata (specifically the DOI or citation info) until I stumbled over this gh issue . Ill see if I can programmatically extract the DOI for each dataset now.
😭 There are datasets without the tracking_id field?!?!
Oh well thats a problem I cannot solve today, but I now have prototype code
import requests
def handle_to_url(handle):
# convert handle to url
return "https://hdl.handle.net/api/handles/"+handle.replace("hdl:","")
def get_json(url):
r = requests.get(url)
if r.status_code == 200:
return r.json()
else:
raise ValueError(f"Failed to retrieve data from {url}")
def get_value(json_response, value_type):
# return only the value index with type "value_type"
for value in json_response['values']:
if value['type'] == value_type:
return value['data']['value']
raise ValueError(f"Value of type {value_type} not found in response")
def get_root_handle(json_response):
# return only the value index with type "IS_PART_OF"
for value in json_response['values']:
if value['type'] == 'IS_PART_OF':
return value['data']['value']
# now wrap this all in one function
def get_doi_from_tracking_id(tracking_id):
"""
Get the DOI from a tracking ID
"""
tracking_ids = ds.attrs['tracking_id'].split('\n')
# check that all handles point to the same root handle
root_handles = [get_value(get_json(handle_to_url(handle)), "IS_PART_OF") for handle in tracking_ids]
# if not all root_handles are the same thow an error
if len(set(root_handles)) > 1:
raise ValueError("Not all handles point to the same root handle")
else:
root_handle = root_handles[0]
# now get to the DOI of the root handle (which is again held in the "IS_PART_OF" value - thats a bit confusing)
doi = get_value(get_json(handle_to_url(root_handle)), "IS_PART_OF")
# if root_doi does not start with doi: then raise an error
if not doi.startswith("doi:"):
raise ValueError("Root handle does not point to a DOI")
return doi
I was able to run this successfully like follows:
import intake
import xarray as xr
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json" # Only stores that pass current tests
col = intake.open_esm_datastore(url)
path = col.df['zstore'].tolist()[100]
ds = xr.open_zarr(path, consolidated=True)
get_doi_from_tracking_id(ds.attrs['tracking_id'])
which gives me 'doi:10.22033/ESGF/CMIP6.11762' that points to https://www.wdc-climate.de/ui/cmip6?input=CMIP6.CMIP.NASA-GISS.GISS-E2-1-G-CC.historical
The original path was 'gs://cmip6/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G-CC/historical/r1i1p1f1/Omon/fsitherm/gn/v20190815/' so this seems to indicate that the method works! ❤️
Ok I deployed this successful across a subset of datasets for a specific publication gist - the code from above is slightly modified to use python async libraries).
Several Observations:
- Not all datasets have the
'tracking_id'attribute 🙈. I was able to get it from others of the same simulation, but this seems bad in general - The performance is very slow. Did not feel like tuning this much today, but if we decide to make this into a part of the package somebody should probably look into this a bit deeper
- I really want to know where this whole PID/handle server business is documented on the ESGF side (if anyone knows, please post links below).
Anyways, happy about a successful end to the first day back from my hiatus hehe.