mik icon indicating copy to clipboard operation
mik copied to clipboard

CdmToMods metadata parser: Repeated Wrapper Element code inadequately handles top-level elements that are also children of top-level elements in the same MODS file.

Open MarcusBarnes opened this issue 8 years ago • 12 comments

Issue #232 identified a problem with the code that deals with consolidating repeated wrapper elements. If the CdmToMods metadata parser produces MODS like that directly below during the metadata parsing process

<titleInfo>
    <title>some title</title>
</titleInfo>

<relatedItem type="series">
    <titleInfo>
      <title>related_item_title</title>
    </titleInfo>
</relatedItem>

The result will be

<titleInfo>
    <title>some title</title>
   <title>related_item_title</title>
</titleInfo>
<relatedItem type="series"/>

within the created MODS.xml, contrary to what is desired. This can be confirmed by adding repeatable_wrapper_elements[] = titleInfo under the [METADATA_PARSER] section of the config file for such a situation.

MarcusBarnes avatar Jul 28 '16 22:07 MarcusBarnes

Related to https://github.com/MarcusBarnes/mik/issues/29

MarcusBarnes avatar Jul 28 '16 22:07 MarcusBarnes

Makes sense, but I was under the impression (wrong obviously) that wrapper elements controlled by that config option were restricted to be top-level (i.e., direct children of mods:mods) elements.

mjordan avatar Jul 28 '16 23:07 mjordan

@mjordan You are correct - it's a bug with the code that deals with consolidating the content of top-level elements.

MarcusBarnes avatar Jul 28 '16 23:07 MarcusBarnes

Related to this issue: difficulty with parsing more complex uses of repeatable elements and the relatedItem set...

Take a mapping file with the following in the top level: "Publisher","<originInfo><publisher>%value%</publisher></originInfo> "Frequency",<originInfo><frequency>%value%</frequency></originInfo> "Place",<originInfo><place><placeTerm>%value%</placeTerm></place></originInfo>"

Because each value belongs under the originInfo parent, no repeatable_wrapper_elements[] for originInfo is used -- each will get stuck into the same originInfo element as desired.

But then, under the relatedItem element, you have the same set of elements under the relatedItem parent: "null11","<relatedItem type=""succeeding""><titleInfo><title>Continues the Abbotsford Post</title><subTitle/></titleInfo><originInfo><publisher>Abbotsford Post</publisher><place><placeTerm type=""text"">Abbotsford</placeTerm></place><issuance>continuing</issuance><dateIssued point=""start"">1910</dateIssued><dateIssued point=""end"">1924</dateIssued><frequency authority=""marcfrequency"">Weekly</frequency></originInfo><identifier type=""issn"">POSTISSN</identifier></relatedItem>"

Without repeatable_wrapper_elements[] = originInfo, all the relatedItem's originInfo child elements get shunted into the first originInfo set.

Any way to deal with this?

bondjimbond avatar Feb 06 '17 18:02 bondjimbond

It's going to be challenging to offer control over when the wrapper element consolidation happens and when it doesn't. For example, what if you wanted consolidation in the main MODS document but not in subdocuments that exist within <relatedItem> or <extension>? I'm wondering if we can modify oneParentWrapperElement() to accept a list of XPaths that identify elements not to be consolidated, e.g., adding an option like ['METADATA_PARSER']['repeatable_wrapper_elements_xpath'? Using an XPath expression would allow very specific control.

mjordan avatar Feb 06 '17 19:02 mjordan

After reading http://www.loc.gov/standards/mods/userguide/relateditem.html#guidelines and seeing how many valid variations are possible, I'm inclined to remove the oneParentWrapperElement() method from the CdmToMods metadataparser class and instead create a metadataparser or postwritehook to deal with situations like this. That way, rather than having one configuration setting that deals with all variations, we can separate these cases out into specific metadataparsers or postwritehooks which use XPath expressions as parameters, to provide the fine grained control. Thoughts?

MarcusBarnes avatar Feb 06 '17 20:02 MarcusBarnes

XML cleanup is a solid use case for post-write hooks for sure. So the idea would be to let MIK just add duplicate wrapper elements and then let the post-write hook script clean up the resulting MODS as necessary?

mjordan avatar Feb 06 '17 20:02 mjordan

This would seem like the better approach. When we first built in the oneParentWrapperElement method, we had not yet started leveraging the concept of post-write hooks. In fact, I think LSU used post-write hooks for the majority of their XML cleanup via XSLTs - see https://github.com/MarcusBarnes/mik/tree/master/extras/lsu/xsl for some examples.

MarcusBarnes avatar Feb 06 '17 20:02 MarcusBarnes

Yes. @bondjimbond what are your thoughts?

mjordan avatar Feb 06 '17 20:02 mjordan

So long as it's simple to use and well-documented, any solution that works looks good to me.

bondjimbond avatar Feb 06 '17 21:02 bondjimbond

Just adding a note that this issue also applies to CsvToMods metadata parser, which is what @bondjimbond was using.

mjordan avatar Feb 07 '17 00:02 mjordan

If this were done post creation of the mods document, then an analog to python's lxml ElementTree library could do this. For example:

def merge_same_fields(orig_etree):
    for elem in orig_etree:
        for following_elem in elem.itersiblings():
            if elem.tag == following_elem.tag and elem.attrib == following_elem.attrib:
                for child in following_elem.iterchildren():
                    elem.insert(-1, child)
    return orig_etree

GarrettArm avatar Mar 16 '17 22:03 GarrettArm