jhove icon indicating copy to clipboard operation
jhove copied to clipboard

XML Module: White spaces are required between publicId and systemId.: Line = 1, Column = 50

Open ross-spencer opened this issue 7 years ago • 5 comments

Dev Effort

1D

Description

Attached are two versions of the same XML, and the corresponding JHOVE output. The example comes from Roland at ethz.

I've recreated what he was seeing, and can't quite understand the reason for the error.

White spaces are required between publicId and systemId.: Line = 1, Column = 50

Given this doesn't seem to be about the XML itself with the line number not changing when I add the XML declaration to the original document, I wonder if it is something to do with the external dependencies.

Extracting the XSD, it seems to only refer to:

http://www.loc.gov/standards/mets/version18/mets.xsd
http://www.loc.gov/standards/mods/v3/mods-3-5.xsd
http://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd
http://www.loc.gov/standards/mix/mix20/mix20.xsd

I can't find systemid or publicid in any. So am not sure what else to check at this point.

jhove-export_mets_2017.no-declaration.xml.txt export_mets_2017_no_declaration.xml.txt export_mets_2017_with_declaration.xml.txt jhove-export_mets_2017.declaration.xml.txt

ross-spencer avatar Apr 27 '17 04:04 ross-spencer

N.B. There do seem to be three instances each of hard-coded literals with these values (publicid, systemid) in the XML Module, e.g.

https://github.com/openpreserve/jhove/blob/08baeef92fff3c15551c17c71f614089f1bed4bc/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/XmlModule.java#L757

https://github.com/openpreserve/jhove/blob/08baeef92fff3c15551c17c71f614089f1bed4bc/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/XmlModule.java#L754

ross-spencer avatar Apr 27 '17 04:04 ross-spencer

The error is a SAX parser error, and yes, it is bubbling up from one of the dependencies. Specifically:

$ curl http://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head><style>#watch-discussion, #watch7-discussion, ytd-comments { display: none; }</style>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd">here</a>.</p>
<hr>
<address>Apache/2.4.10 (Linux/SUSE) Server at www.danrw.de Port 80</address>
</body></html>

i.e. the danrw-contract-1.xsd schema is moved. If I update the schema declaration to use https://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd it validates. It's surprising that the SAX parser does not follow redirects when finding XSDs.

anjackson avatar Apr 27 '17 08:04 anjackson

Okay, here's a relevant StackOverflow Q that has a solution: http://stackoverflow.com/questions/29696638/how-to-validate-xml-with-schema-urls-that-return-http-301

anjackson avatar Apr 27 '17 15:04 anjackson

At the ETH Data Archive we got two additional files that may cause the same issue in JHOVE as the one described above. As in the previously attached file, JHOVE considers these two files as not well-formed because the path to the xsd schema is automatically redirected by the browser from http to https.

The attached File Dia_002-034_10776.xml is considered by JHOVE to be not well-formed (Dia_002-034_10776.xml.txt). Again the JHOVE error message is “space required between publicId and systemID“ (Dia_002- 034_10776_JHOVEreport.xml.txt). The URI to the schema
http://www.e-pics.ethz.ch/index/rosetta/schema/epics_rosetta_schema.xsd is redirected in my browser to the corresponding https location. To avoid the redirect, I replaced the path http://www.e-pics.ethz.ch in all its five instances with https://www.e-pics.ethz.ch (Dia_002-034_10776_httpReplacedByHttpsForAllePicsPathes.xml.txt). This file is valid and well-formed.

The file 10539670.xml is considered by JHOVE to be not well-formed (10539670.xml.txt). The error message is “premature end of file” (10539670_JHOVEreport.xml.txt). The file contains an invalid URL to an XSD File: http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml . I guess even with an invalid path to a schema, the file should be well-formed? The URL is redirected in my browser from http to https. If I adapt the XML File by replacing in the previous URL http by https (10539670_httpReplacedByHttpsInwwwAbbyyCom.xml.txt), JHOVE reports the file (after some minutes of computing time) as well-formed but not valid with an error message “cannot find declaration of element ‘document’” (10539670_httpReplacedByHttpsInwwwAbbyyCom_JHOVEreport.xml.txt).

rolandsuri avatar May 23 '17 15:05 rolandsuri