mik icon indicating copy to clipboard operation
mik copied to clipboard

OAI-PMH Filegetter only works for DC

Open bondjimbond opened this issue 5 years ago • 11 comments

I want to extract objects from a repository using the MODS metadataPrefix, but I'm finding that I can't get the files.

It turns out that src/filegetters/OaipmhXpath.php only works when the metadataPrefix is DC:

        // Parse out the dc:identifier whose value starts with 'http'.
        $dom = new \DOMDocument;
        $xml = file_get_contents($raw_metadata_path);
        $dom->loadXML($xml);
        $xpath = new \DOMXPath($dom);
        $xpath->registerNamespace('oai_dc', 'http://www.openarchives.org/OAI/2.0/oai_dc/');
        $xpath->registerNamespace('dc', 'http://purl.org/dc/elements/1.1/');
        $download_url_elements = $xpath->query($this->xpathExpression);

We need to either make this file work for multiple metadataPrefix choices, or have a separate fileGetter for MODS.

bondjimbond avatar Jul 02 '19 19:07 bondjimbond

@bondjimbond there is https://github.com/MarcusBarnes/mik/blob/master/src/metadataparsers/mods/OaiToMods.php. If you use

[METADATA_PARSER]
class = mods\OaiToMods

in conjunction with

[FETCHER]
metadata_prefix = mods

(or whatever the correct metadataPrefix value is) what happens?

mjordan avatar Jul 02 '19 19:07 mjordan

[2019-07-02 18:47:56] ErrorException.ERROR: ErrorException {"message":"DOMXPath::query(): Undefined namespace prefix","code":{"record_key":"oai%3Amruir.mtroyal.ca%3A11205%2F98","raw_metadata_path":"/Volumes/Arca/tmp/oaitest_temp/oai%3Amruir.mtroyal.ca%3A11205%2F98.metadata","dom":"[object] (DOMDocument: {})","xml":"<record xmlns=\"http://www.openarchives.org/OAI/2.0/\">\n            <header>\n                <identifier>oai:mruir.mtroyal.ca:11205/98</identifier>\n                <datestamp>2015-06-08T16:02:09Z</datestamp>\n                <setSpec>com_11205_20</setSpec>\n                <setSpec>com_11205_12</setSpec>\n                <setSpec>col_11205_43</setSpec>\n            </header>\n            <metadata><mods:mods xmlns:mods=\"http://www.loc.gov/mods/v3\" xmlns:doc=\"http://www.lyncode.com/xoai\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-1.xsd\">\n<mods:name>\n<mods:role>\n<mods:roleTerm type=\"text\">author</mods:roleTerm>\n</mods:role>\n<mods:namePart>Hayman, Richard</mods:namePart>\n</mods:name>\n<mods:extension>\n<mods:dateAccessioned encoding=\"iso8601\">2014-02-13T20:44:02Z</mods:dateAccessioned>\n</mods:extension>\n<mods:extension>\n<mods:dateAvailable encoding=\"iso8601\"/>\n</mods:extension>\n<mods:originInfo>\n<mods:dateIssued encoding=\"iso8601\">2009</mods:dateIssued>\n</mods:originInfo>\n<mods:identifier type=\"citation\">Hayman, R. (2009). Human rights software: Information support solutions for social justice. Information for Social Change, 29, 44-67.</mods:identifier>\n<mods:identifier type=\"issn\">1756-901X</mods:identifier>\n<mods:identifier type=\"uri\">http://hdl.handle.net/11205/98</mods:identifier>\n<mods:abstract>Human rights centres and non-governmental organizations (NGOs) have crucial information support needs, many of which can be met by the existing and ongoing development of information technology software applications. For communication and Internet use, the psiphon program allows for secure and anonymous information exchange and distribution, including firewall circumvention. For data collection, organization, encryption, and storage, Martus software can be deployed to help protect sensitive information and identities. Based on documented projects and websites, the following research examines these emancipatory tools to determine: the technologies in use, emergent, and under development; their possible usage in the critical arenas under discussion; and, the greater effects of these technologies as they relate to social justice and information access in the global information society. The purpose is to raise awareness within human rights communities and information centres about the existence and availability of these tools, so that these groups may find appropriate and accessible solutions that match their information support needs. Further, it is hoped that the information presented here will generate open, intercultural, and international discussions of human rights policy development, strategic planning, and implementation.</mods:abstract>\n<mods:language>\n<mods:languageTerm authority=\"rfc3066\">en</mods:languageTerm>\n</mods:language>\n<mods:accessCondition type=\"useAndReproduction\">Attribution-NonCommercial-NoDerivs 2.5 Canada</mods:accessCondition>\n<mods:subject>\n<mods:topic>Human rights</mods:topic>\n</mods:subject>\n<mods:subject>\n<mods:topic>Social justice</mods:topic>\n</mods:subject>\n<mods:subject>\n<mods:topic>Librarianship</mods:topic>\n</mods:subject>\n<mods:titleInfo>\n<mods:title>Human Rights Software: Information Support Solutions For Social Justice</mods:title>\n</mods:titleInfo>\n<mods:genre>Article</mods:genre>\n<mods:objectIdentifierValue>http://mruir.mtroyal.ca/xmlui/bitstream/11205/98/1/Human+Rights+Software.pdf</mods:objectIdentifierValue>\n</mods:mods>\n</metadata>\n        </record>","xpath":"[object] (DOMXPath: {})"},"severity":2,"file":"/Users/brandon/sfuvault/mik/src/filegetters/OaipmhXpath.php","line":61} []
[2019-07-02 18:47:56] ErrorException.ERROR: ErrorException {"message":"problem writing package","record_key":"oai%3Amruir.mtroyal.ca%3A11205%2F98","details":"[object] (mik\\exceptions\\MikErrorException(code: 0):  at /Users/brandon/sfuvault/mik/mik:105)"} []

And if I leave the METADATA_PARSER section at dc\OaiToDc it's the same...

[2019-07-02 18:47:56] ErrorException.ERROR: ErrorException {"message":"DOMXPath::query(): Undefined namespace prefix","code":{"record_key":"oai%3Amruir.mtroyal.ca%3A11205%2F98","raw_metadata_path":"/Volumes/Arca/tmp/oaitest_temp/oai%3Amruir.mtroyal.ca%3A11205%2F98.metadata","dom":"[object] (DOMDocument: {})","xml":"<record xmlns=\"http://www.openarchives.org/OAI/2.0/\">\n            <header>\n                <identifier>oai:mruir.mtroyal.ca:11205/98</identifier>\n                <datestamp>2015-06-08T16:02:09Z</datestamp>\n                <setSpec>com_11205_20</setSpec>\n                <setSpec>com_11205_12</setSpec>\n                <setSpec>col_11205_43</setSpec>\n            </header>\n            <metadata><mods:mods xmlns:mods=\"http://www.loc.gov/mods/v3\" xmlns:doc=\"http://www.lyncode.com/xoai\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-1.xsd\">\n<mods:name>\n<mods:role>\n<mods:roleTerm type=\"text\">author</mods:roleTerm>\n</mods:role>\n<mods:namePart>Hayman, Richard</mods:namePart>\n</mods:name>\n<mods:extension>\n<mods:dateAccessioned encoding=\"iso8601\">2014-02-13T20:44:02Z</mods:dateAccessioned>\n</mods:extension>\n<mods:extension>\n<mods:dateAvailable encoding=\"iso8601\"/>\n</mods:extension>\n<mods:originInfo>\n<mods:dateIssued encoding=\"iso8601\">2009</mods:dateIssued>\n</mods:originInfo>\n<mods:identifier type=\"citation\">Hayman, R. (2009). Human rights software: Information support solutions for social justice. Information for Social Change, 29, 44-67.</mods:identifier>\n<mods:identifier type=\"issn\">1756-901X</mods:identifier>\n<mods:identifier type=\"uri\">http://hdl.handle.net/11205/98</mods:identifier>\n<mods:abstract>Human rights centres and non-governmental organizations (NGOs) have crucial information support needs, many of which can be met by the existing and ongoing development of information technology software applications. For communication and Internet use, the psiphon program allows for secure and anonymous information exchange and distribution, including firewall circumvention. For data collection, organization, encryption, and storage, Martus software can be deployed to help protect sensitive information and identities. Based on documented projects and websites, the following research examines these emancipatory tools to determine: the technologies in use, emergent, and under development; their possible usage in the critical arenas under discussion; and, the greater effects of these technologies as they relate to social justice and information access in the global information society. The purpose is to raise awareness within human rights communities and information centres about the existence and availability of these tools, so that these groups may find appropriate and accessible solutions that match their information support needs. Further, it is hoped that the information presented here will generate open, intercultural, and international discussions of human rights policy development, strategic planning, and implementation.</mods:abstract>\n<mods:language>\n<mods:languageTerm authority=\"rfc3066\">en</mods:languageTerm>\n</mods:language>\n<mods:accessCondition type=\"useAndReproduction\">Attribution-NonCommercial-NoDerivs 2.5 Canada</mods:accessCondition>\n<mods:subject>\n<mods:topic>Human rights</mods:topic>\n</mods:subject>\n<mods:subject>\n<mods:topic>Social justice</mods:topic>\n</mods:subject>\n<mods:subject>\n<mods:topic>Librarianship</mods:topic>\n</mods:subject>\n<mods:titleInfo>\n<mods:title>Human Rights Software: Information Support Solutions For Social Justice</mods:title>\n</mods:titleInfo>\n<mods:genre>Article</mods:genre>\n<mods:objectIdentifierValue>http://mruir.mtroyal.ca/xmlui/bitstream/11205/98/1/Human+Rights+Software.pdf</mods:objectIdentifierValue>\n</mods:mods>\n</metadata>\n        </record>","xpath":"[object] (DOMXPath: {})"},"severity":2,"file":"/Users/brandon/sfuvault/mik/src/filegetters/OaipmhXpath.php","line":61} []
[2019-07-02 18:47:56] ErrorException.ERROR: ErrorException {"message":"problem writing package","record_key":"oai%3Amruir.mtroyal.ca%3A11205%2F98","details":"[object] (mik\\exceptions\\MikErrorException(code: 0):  at /Users/brandon/sfuvault/mik/mik:105)"} []

bondjimbond avatar Jul 02 '19 19:07 bondjimbond

@mjordan Really this is about the FileGetter and not the MetadataParser, isn't it? The problem is that I use an XPath to find the link to download, but XPath can't recognize it because the FileGetter defines a Dublin Core namespace and not a MODS namespace.

bondjimbond avatar Jul 04 '19 18:07 bondjimbond

I added a new filegetter to #504 to address this. No longer saying "undefined namespace prefix" -- now it's just saying "No content file found in oai-pmh record".

bondjimbond avatar Jul 04 '19 19:07 bondjimbond

@bondjimbond Since you're requesting MODS over OAI, is it safe to assume that you're source repository is Islandora? If so, then yes, I think we should just be grabbing the MODS datastream as a file and not get tangled up in metadata parsers. In that case, we can just fetch the MODS datastream using the (working?) DC metadata parser and then throw away the resulting DC XML files.

I was sure that we already had the ability to fetch any datastream we wanted using the https://github.com/MarcusBarnes/mik/wiki/Toolchain:-OAI-PMH-for-Islandora-repositories toolchain, but I need to confirm that. If not, it won't be difficult to make that happen.

mjordan avatar Jul 04 '19 21:07 mjordan

@mjordan Nope, it's actually a DSpace repository. They've got decent MODS, though, so it's nice to be able to pull that down and tweak it instead of extracting DC and then trying to reverse engineer roleTerms etc.

bondjimbond avatar Jul 05 '19 14:07 bondjimbond

Does DSpace's MODS have a predictable URL where you can download it (as per my last comment) or do you need to get it via OAI as metadata?

mjordan avatar Jul 05 '19 16:07 mjordan

You need to get it via OAI, unfortunately. The filename is made of some mix of parts of the title and some seemingly arbitrary numbers.

bondjimbond avatar Jul 05 '19 16:07 bondjimbond

Can you send me the OAI endpoint via email?

mjordan avatar Jul 05 '19 16:07 mjordan

The .ini file (which includes the endpoint) is attached to #502

bondjimbond avatar Jul 05 '19 16:07 bondjimbond

Also, here's an example file link: http://mruir.mtroyal.ca/xmlui/bitstream/11205/98/1/Human+Rights+Software.pdf

I think the 11205/98/1 is the handle, but the filename is not really predictable.

bondjimbond avatar Jul 05 '19 16:07 bondjimbond