xMIP icon indicating copy to clipboard operation
xMIP copied to clipboard

WIP: Extracting References from xarray metadata

Open jbusecke opened this issue 7 months ago • 3 comments

The goal here is to get the reference info (DOI) for a bunch of data that was used for a publication from the Pangeo-ESGF CMIP6 Zarr Data 2.0.

jbusecke avatar Apr 23 '25 17:04 jbusecke

I remembered and found online that the relevant attribute is tracking_id which usually has values that look like 'hdl:xxx/xxx'. I had trouble finding any info on how to lookup the metadata (specifically the DOI or citation info) until I stumbled over this gh issue . Ill see if I can programmatically extract the DOI for each dataset now.

jbusecke avatar Apr 23 '25 18:04 jbusecke

😭 There are datasets without the tracking_id field?!?!

Oh well thats a problem I cannot solve today, but I now have prototype code

import requests

def handle_to_url(handle):
    # convert handle to url
    return "https://hdl.handle.net/api/handles/"+handle.replace("hdl:","")

def get_json(url):
    r = requests.get(url)
    if r.status_code == 200:
        return r.json()
    else:
        raise ValueError(f"Failed to retrieve data from {url}")

def get_value(json_response, value_type):
    # return only the value index with type "value_type"
    for value in json_response['values']:
        if value['type'] == value_type:
            return value['data']['value']
    raise ValueError(f"Value of type {value_type} not found in response")

def get_root_handle(json_response):
    # return only the value index with type "IS_PART_OF"
    for value in json_response['values']:
        if value['type'] == 'IS_PART_OF':
            return value['data']['value']
        
# now wrap this all in one function
def get_doi_from_tracking_id(tracking_id):
    """
    Get the DOI from a tracking ID
    """
    tracking_ids = ds.attrs['tracking_id'].split('\n')
    # check that all handles point to the same root handle
    root_handles = [get_value(get_json(handle_to_url(handle)), "IS_PART_OF") for handle in tracking_ids]
    # if not all root_handles are the same thow an error
    if len(set(root_handles)) > 1:
        raise ValueError("Not all handles point to the same root handle")
    else:
        root_handle = root_handles[0]
    # now get to the DOI of the root handle (which is again held in the "IS_PART_OF" value - thats a bit confusing)
    doi = get_value(get_json(handle_to_url(root_handle)), "IS_PART_OF")
    # if root_doi does not start with doi: then raise an error
    if not doi.startswith("doi:"):
        raise ValueError("Root handle does not point to a DOI")
    return doi

I was able to run this successfully like follows:

import intake
import xarray as xr


# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json" # Only stores that pass current tests
col = intake.open_esm_datastore(url)

path = col.df['zstore'].tolist()[100]

ds = xr.open_zarr(path, consolidated=True)
get_doi_from_tracking_id(ds.attrs['tracking_id'])

which gives me 'doi:10.22033/ESGF/CMIP6.11762' that points to https://www.wdc-climate.de/ui/cmip6?input=CMIP6.CMIP.NASA-GISS.GISS-E2-1-G-CC.historical

The original path was 'gs://cmip6/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G-CC/historical/r1i1p1f1/Omon/fsitherm/gn/v20190815/' so this seems to indicate that the method works! ❤️

jbusecke avatar Apr 23 '25 18:04 jbusecke

Ok I deployed this successful across a subset of datasets for a specific publication gist - the code from above is slightly modified to use python async libraries).

Several Observations:

  • Not all datasets have the 'tracking_id' attribute 🙈. I was able to get it from others of the same simulation, but this seems bad in general
  • The performance is very slow. Did not feel like tuning this much today, but if we decide to make this into a part of the package somebody should probably look into this a bit deeper
  • I really want to know where this whole PID/handle server business is documented on the ESGF side (if anyone knows, please post links below).

Anyways, happy about a successful end to the first day back from my hiatus hehe.

jbusecke avatar Apr 23 '25 23:04 jbusecke