pycsw
pycsw copied to clipboard
Issues with pycsw mapping ISO-DIF
Description
Problem: mapping of ISO records to DIF (using GCMD DIF type/subtype vocabulary).
Given an ISO-compliant metadata Record, I encountered some issues in the mapping to DIF at different levels. Listing two examples:
- Data Access
- Dataset Landing Page
Environment
- operating system:
- Linux - Ubuntu Server 20.04
- Python version:
- Python 3.8.x
- pycsw version:
- Git Master
- source/distribution
- [X] git clone
- [ ] DebianGIS/UbuntuGIS
- [ ] PyPI
- [ ] zip/tar.gz
- [ ] other (please specify):
- web server
- [X] Apache/mod_wsgi
- [ ] CGI
- [ ] other (please specify):
Steps to Reproduce
Indexing the following ISO Record:
Results in the following DIF profile
The DIF output doesn't match the information available in the original ISO source.
Data Access
- Problem: mapping of protocols between ISO and DIF (using GCMD DIF type/subtype vocabulary).
Currently the protocols are just the same as the ISO records.
Current DIF output
<dif:Related_URL>
<dif:URL_Content_Type>
<dif:Type>OPENDAP:OPENDAP</dif:Type>
</dif:URL_Content_Type>
<dif:URL>opendap url</dif:URL>
<dif:Description>None</dif:Description>
</dif:Related_URL>
<dif:Related_URL>
<dif:URL_Content_Type>
<dif:Type>download</dif:Type>
</dif:URL_Content_Type>
<dif:URL>http download url</dif:URL>
<dif:Description>None</dif:Description>
</dif:Related_URL>
Expected DIF9.7 output
<Related_URL>
<URL_Content_Type>
<Type>GET DATA</Type>
<Subtype>OPENDAP DATA (DODS)</Subtype>
</URL_Content_Type>
<URL>opendapurl</URL>
</Related_URL>
<Related_URL>
<URL_Content_Type>
<Type>GET SERVICE</Type>
<Subtype>GET WEB MAP SERVICE (WMS)</Subtype>
</URL_Content_Type>
<URL>wmsurl</URL>
</Related_URL>
<Related_URL>
<URL_Content_Type>
<Type>GET DATA</Type>
</URL_Content_Type>
<URL>Http download url</URL>
</Related_URL>
Dataset landing page
Current ISO output
<gmd:dataSetURI>
<gco:CharacterString>Dataset landing page</gco:CharacterString>
</gmd:dataSetURI>
- In DIF the landing page is exposed in two ways:
As Related_URL using type DATASET LANDING PAGE.
Expected DIF output
<Related_URL>
<URL_Content_Type>
<Type>DATASET LANDING PAGE</Type>
</URL_Content_Type>
<URL>dataset landing page url</URL>
</Related_URL>
- the Online_Resource in the Data_set_citation element. This is currently missing, and should be added to our reference ISO record.
Current DIF output:
<dif:Data_Set_Citation>
<dif:Dataset_Creator/>
<dif:Dataset_Release_Date/>
<dif:Dataset_Publisher/>
<dif:Data_Presentation_Form/>
</dif:Data_Set_Citation>
Expected DIF output
<Data_Set_Citation>
<Dataset_Creator>xx</Dataset_Creator>
<Dataset_Title>xx</Dataset_Title>
<Dataset_Release_Date>2017-02-23T00:00:00:00Z</Dataset_Release_Date>
<Dataset_Publisher>xx</Dataset_Publisher>
...
<Online_Resource>Dataset landing page URI</Online_Resource>
</Data_Set_Citation>
Additional Information
There are other issues related to how the ISO keywords are mapped to DIF in particular the GCMD Science Keywords.
in ISO we have:
<?xml version="1.0"?>
<gmd:descriptiveKeywords>
<gmd:MD_Keywords>
<gmd:keyword>
<gco:CharacterString>
EARTH SCIENCE > Atmosphere > Atmospheric Temperature > Surface Temperature > Air Temperature
</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>
EARTH SCIENCE > Atmosphere > Atmospheric Winds > Surface Winds
</gco:CharacterString>
</gmd:keyword>
<gmd:keyword>
<gco:CharacterString>
EARTH SCIENCE > Atmosphere > Atmospheric Water Vapor
</gco:CharacterString>
</gmd:keyword>
<gmd:thesaurusName>
<gmd:CI_Citation>
<gmd:title>
<gco:CharacterString>gcmd</gco:CharacterString>
</gmd:title>
</gmd:CI_Citation>
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
see reference ISO
-
This in mapped from
apiso:Subjectintocsw:Keywordswhich is then mapped todif:Keywordin dif.py -
In principle it should be mapped instead into DIF 9 As Parameters (with subelement) when the thesauri name is GCMD and Keyword (string) for any other thesauri name.
As this is too complicated I would try to get only the GCMD thesauri, thus I need to map all ISO entries to Parameter in this structure:
<Parameters>
<Category>EARTH SCIENCE</Category>
<Topic>SPECTRAL/ENGINEERING</Topic>
<Term>RADAR</Term>
<Variable_Level_1>RADAR BACKSCATTER</Variable_Level_1>
</Parameters>
See http://metadata.nersc.no/oai?verb=ListRecords&metadataPrefix=dif for example
Regarding the last part of the issue, the one related to the keywords issue - to distinguish between keywords in ISO with and without a thesaurus_name, will it make sense to have a column (which can be empty) to sp[ecify the 'dialect'/'flavour' of the ISO record ... in my case GCMD? -- then try to add some logic in the core code to distinguish between keywords with/without a thesaurs_name .. which will affect the transformation into a specific output profile?
I may have found a little hack to tune the output the way I needed, by modifying 'dif.py':
# keywords
val = util.getqattr(result, context.md_core_model['mappings']['pycsw:Keywords'])
if val:
for kw in val.split(','):
if len(kw.split(">")) >= 2:
values = kw.split(">")
parameters = etree.SubElement(node, util.nspath_eval('dif:Parameters', NAMESPACES)) # .text = kw
etree.SubElement(parameters, util.nspath_eval('dif:Category', NAMESPACES)).text = values[0]
etree.SubElement(parameters, util.nspath_eval('dif:Topic', NAMESPACES)).text = values[1]
etree.SubElement(parameters, util.nspath_eval('dif:Term', NAMESPACES)).text = values[2]
for i,v in enumerate(values[3:]):
etree.SubElement(parameters, util.nspath_eval(f'dif:Variable_Level_{i+1}', NAMESPACES)).text = v
else:
etree.SubElement(node, util.nspath_eval('dif:Keywords', NAMESPACES)).text = kw
Note, this will work only for my specific case where I am sure the GCMD keywords I need to parse have all the > symbol as splitter.
The code above will return:
<dif:Parameters>
<dif:Category>Earth Science</dif:Category>
<dif:Topic>Atmosphere</dif:Topic>
<dif:Term>Atmospheric radiation</dif:Term>
<dif:Variable_Level_1>Reflectance</dif:Variable_Level_1>
</dif:Parameters>
From a ISO keywords like:
<gmd:keyword>
<gco:CharacterString>
EARTH SCIENCE > Atmosphere > Atmospheric Winds > Surface Winds
</gco:CharacterString>
</gmd:keyword>