jhove
jhove copied to clipboard
XML Module: White spaces are required between publicId and systemId.: Line = 1, Column = 50
Dev Effort
1D
Description
Attached are two versions of the same XML, and the corresponding JHOVE output. The example comes from Roland at ethz.
I've recreated what he was seeing, and can't quite understand the reason for the error.
White spaces are required between publicId and systemId.: Line = 1, Column = 50
Given this doesn't seem to be about the XML itself with the line number not changing when I add the XML declaration to the original document, I wonder if it is something to do with the external dependencies.
Extracting the XSD, it seems to only refer to:
http://www.loc.gov/standards/mets/version18/mets.xsd
http://www.loc.gov/standards/mods/v3/mods-3-5.xsd
http://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd
http://www.loc.gov/standards/mix/mix20/mix20.xsd
I can't find systemid or publicid in any. So am not sure what else to check at this point.
jhove-export_mets_2017.no-declaration.xml.txt export_mets_2017_no_declaration.xml.txt export_mets_2017_with_declaration.xml.txt jhove-export_mets_2017.declaration.xml.txt
N.B. There do seem to be three instances each of hard-coded literals with these values (publicid, systemid) in the XML Module, e.g.
https://github.com/openpreserve/jhove/blob/08baeef92fff3c15551c17c71f614089f1bed4bc/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/XmlModule.java#L757
https://github.com/openpreserve/jhove/blob/08baeef92fff3c15551c17c71f614089f1bed4bc/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/XmlModule.java#L754
The error is a SAX parser error, and yes, it is bubbling up from one of the dependencies. Specifically:
$ curl http://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head><style>#watch-discussion, #watch7-discussion, ytd-comments { display: none; }</style>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd">here</a>.</p>
<hr>
<address>Apache/2.4.10 (Linux/SUSE) Server at www.danrw.de Port 80</address>
</body></html>
i.e. the danrw-contract-1.xsd
schema is moved. If I update the schema declaration to use https://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd
it validates. It's surprising that the SAX parser does not follow redirects when finding XSDs.
Okay, here's a relevant StackOverflow Q that has a solution: http://stackoverflow.com/questions/29696638/how-to-validate-xml-with-schema-urls-that-return-http-301
At the ETH Data Archive we got two additional files that may cause the same issue in JHOVE as the one described above. As in the previously attached file, JHOVE considers these two files as not well-formed because the path to the xsd schema is automatically redirected by the browser from http to https.
The attached File Dia_002-034_10776.xml is considered by JHOVE to be not well-formed (Dia_002-034_10776.xml.txt). Again the JHOVE error message is “space required between publicId and systemID“ (Dia_002-
034_10776_JHOVEreport.xml.txt). The URI to the schema
http://www.e-pics.ethz.ch/index/rosetta/schema/epics_rosetta_schema.xsd is redirected in my browser to the corresponding https location. To avoid the redirect, I replaced the path http://www.e-pics.ethz.ch in all its five instances with https://www.e-pics.ethz.ch (Dia_002-034_10776_httpReplacedByHttpsForAllePicsPathes.xml.txt). This file is valid and well-formed.
The file 10539670.xml is considered by JHOVE to be not well-formed (10539670.xml.txt). The error message is “premature end of file” (10539670_JHOVEreport.xml.txt). The file contains an invalid URL to an XSD File: http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml . I guess even with an invalid path to a schema, the file should be well-formed? The URL is redirected in my browser from http to https. If I adapt the XML File by replacing in the previous URL http by https (10539670_httpReplacedByHttpsInwwwAbbyyCom.xml.txt), JHOVE reports the file (after some minutes of computing time) as well-formed but not valid with an error message “cannot find declaration of element ‘document’” (10539670_httpReplacedByHttpsInwwwAbbyyCom_JHOVEreport.xml.txt).