pycsw icon indicating copy to clipboard operation
pycsw copied to clipboard

Issues with pycsw mapping ISO-DIF

Open epifanio opened this issue 4 years ago • 3 comments

Description

Problem: mapping of ISO records to DIF (using GCMD DIF type/subtype vocabulary).

Given an ISO-compliant metadata Record, I encountered some issues in the mapping to DIF at different levels. Listing two examples:

  • Data Access
  • Dataset Landing Page

Environment

  • operating system:
    • Linux - Ubuntu Server 20.04
  • Python version:
    • Python 3.8.x
  • pycsw version:
    • Git Master
  • source/distribution
    • [X] git clone
    • [ ] DebianGIS/UbuntuGIS
    • [ ] PyPI
    • [ ] zip/tar.gz
    • [ ] other (please specify):
  • web server
    • [X] Apache/mod_wsgi
    • [ ] CGI
    • [ ] other (please specify):

Steps to Reproduce

Indexing the following ISO Record:

Results in the following DIF profile

The DIF output doesn't match the information available in the original ISO source.

Data Access

  • Problem: mapping of protocols between ISO and DIF (using GCMD DIF type/subtype vocabulary).

Currently the protocols are just the same as the ISO records.

Current DIF output

<dif:Related_URL>
  <dif:URL_Content_Type>
    <dif:Type>OPENDAP:OPENDAP</dif:Type>
  </dif:URL_Content_Type>
  <dif:URL>opendap url</dif:URL> 
  <dif:Description>None</dif:Description>
</dif:Related_URL>

<dif:Related_URL>
  <dif:URL_Content_Type>
    <dif:Type>download</dif:Type>
  </dif:URL_Content_Type>
  <dif:URL>http download url</dif:URL>
  <dif:Description>None</dif:Description>
</dif:Related_URL>

Expected DIF9.7 output

<Related_URL>
  <URL_Content_Type>
    <Type>GET DATA</Type>
    <Subtype>OPENDAP DATA (DODS)</Subtype>
  </URL_Content_Type>
  <URL>opendapurl</URL>
</Related_URL>

<Related_URL>
  <URL_Content_Type>
    <Type>GET SERVICE</Type>
    <Subtype>GET WEB MAP SERVICE (WMS)</Subtype>
  </URL_Content_Type>
  <URL>wmsurl</URL>
</Related_URL>

<Related_URL>
  <URL_Content_Type>
    <Type>GET DATA</Type>
    </URL_Content_Type>
  <URL>Http download url</URL>
</Related_URL>

Dataset landing page

Current ISO output

<gmd:dataSetURI>
   <gco:CharacterString>Dataset landing page</gco:CharacterString>
</gmd:dataSetURI>
  • In DIF the landing page is exposed in two ways:

As Related_URL using type DATASET LANDING PAGE.

Expected DIF output

<Related_URL>
  <URL_Content_Type>
    <Type>DATASET LANDING PAGE</Type>
    </URL_Content_Type>
  <URL>dataset landing page url</URL>
</Related_URL>
  • the Online_Resource in the Data_set_citation element. This is currently missing, and should be added to our reference ISO record.

Current DIF output:

<dif:Data_Set_Citation>
   <dif:Dataset_Creator/>
   <dif:Dataset_Release_Date/>
   <dif:Dataset_Publisher/>
   <dif:Data_Presentation_Form/>
</dif:Data_Set_Citation>

Expected DIF output

<Data_Set_Citation>
   <Dataset_Creator>xx</Dataset_Creator>
   <Dataset_Title>xx</Dataset_Title>
   <Dataset_Release_Date>2017-02-23T00:00:00:00Z</Dataset_Release_Date>
   <Dataset_Publisher>xx</Dataset_Publisher>
...
   <Online_Resource>Dataset landing page URI</Online_Resource>
</Data_Set_Citation>

Additional Information

There are other issues related to how the ISO keywords are mapped to DIF in particular the GCMD Science Keywords.

in ISO we have:

<?xml version="1.0"?>
<gmd:descriptiveKeywords>
  <gmd:MD_Keywords>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Temperature &gt; Surface Temperature &gt; Air Temperature
</gco:CharacterString>
    </gmd:keyword>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Winds &gt; Surface Winds
</gco:CharacterString>
    </gmd:keyword>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Water Vapor
</gco:CharacterString>
    </gmd:keyword>
    <gmd:thesaurusName>
      <gmd:CI_Citation>
        <gmd:title>
          <gco:CharacterString>gcmd</gco:CharacterString>
        </gmd:title>
      </gmd:CI_Citation>
    </gmd:thesaurusName>
  </gmd:MD_Keywords>
</gmd:descriptiveKeywords>

see reference ISO

  • This in mapped from apiso:Subject into csw:Keywords which is then mapped to dif:Keyword in dif.py

  • In principle it should be mapped instead into DIF 9 As Parameters (with subelement) when the thesauri name is GCMD and Keyword (string) for any other thesauri name.

As this is too complicated I would try to get only the GCMD thesauri, thus I need to map all ISO entries to Parameter in this structure:

<Parameters>
<Category>EARTH SCIENCE</Category>
<Topic>SPECTRAL/ENGINEERING</Topic>
<Term>RADAR</Term>
<Variable_Level_1>RADAR BACKSCATTER</Variable_Level_1>
</Parameters>

See http://metadata.nersc.no/oai?verb=ListRecords&metadataPrefix=dif for example

epifanio avatar Feb 15 '21 22:02 epifanio

Regarding the last part of the issue, the one related to the keywords issue - to distinguish between keywords in ISO with and without a thesaurus_name, will it make sense to have a column (which can be empty) to sp[ecify the 'dialect'/'flavour' of the ISO record ... in my case GCMD? -- then try to add some logic in the core code to distinguish between keywords with/without a thesaurs_name .. which will affect the transformation into a specific output profile?

epifanio avatar Feb 17 '21 20:02 epifanio

I may have found a little hack to tune the output the way I needed, by modifying 'dif.py':

    # keywords
    val = util.getqattr(result, context.md_core_model['mappings']['pycsw:Keywords'])

    if val:
        for kw in val.split(','):
            if len(kw.split(">")) >= 2:
                values = kw.split(">")
                parameters = etree.SubElement(node, util.nspath_eval('dif:Parameters', NAMESPACES))  # .text = kw
                etree.SubElement(parameters, util.nspath_eval('dif:Category', NAMESPACES)).text = values[0]
                etree.SubElement(parameters, util.nspath_eval('dif:Topic', NAMESPACES)).text = values[1]
                etree.SubElement(parameters, util.nspath_eval('dif:Term', NAMESPACES)).text = values[2]
                for i,v in enumerate(values[3:]):
                    etree.SubElement(parameters, util.nspath_eval(f'dif:Variable_Level_{i+1}', NAMESPACES)).text = v
            else:
                etree.SubElement(node, util.nspath_eval('dif:Keywords', NAMESPACES)).text = kw

Note, this will work only for my specific case where I am sure the GCMD keywords I need to parse have all the > symbol as splitter.

The code above will return:

<dif:Parameters>
    <dif:Category>Earth Science</dif:Category>
    <dif:Topic>Atmosphere</dif:Topic>
    <dif:Term>Atmospheric radiation</dif:Term>
    <dif:Variable_Level_1>Reflectance</dif:Variable_Level_1>
</dif:Parameters>

From a ISO keywords like:

<gmd:keyword>
    <gco:CharacterString>
        EARTH SCIENCE > Atmosphere > Atmospheric Winds > Surface Winds
    </gco:CharacterString>
</gmd:keyword>

epifanio avatar Feb 24 '21 15:02 epifanio