jhove icon indicating copy to clipboard operation
jhove copied to clipboard

XML that is well-formed according to xmllint reported as "Not well-formed" by JHOVE

Open bitsgalore opened this issue 7 years ago • 12 comments

Dev Effort

1D - investigation

Description

Look at the following METS file:

mets-test.xml

According to xmllint (v. 20904) it contains well-formed XML:

xmllint --noout mets-test.xml

Result: no error messages (which means it is well-formed)

Next I try to validate the same file with JHOVE (v. 1.18.1):

jhove -m XML-hul met-tests.xml

Result:

Jhove (Rel. 1.18.1, 2017-11-30)
 Date: 2018-02-22 12:53:55 CET
 RepresentationInformation: mets-test.xml
  ReportingModule: XML-hul, Rel. 1.4 (2007-01-08)
  LastModified: 2018-02-21 17:31:20 CET
  Size: 106387
  Format: XML
  Status: Not well-formed
  SignatureMatches:
   XML-hul
  ErrorMessage: XML document structures must start and end within the same entity.: Line = 1, Column = 97
  MIMEtype: text/xml

So according to JHOVE the file is not well-formed! The error message is even more puzzling, as the position corresponds to a namespace definition.

The strangest thing of all is that some XML documents that would pass a "well-formed" by JHOVE only yesterday suddenly give me the above error today! I initially suspected something weird going on in my JHOVE configuration, but after uninstalling + reinstalling JHOVE + checking on 2 different machines (Windows + Linux machine) I keep getting the above error for several XML documents that somehow passed well-formedness checks earlier on. Or am I overlooking something obvious here?

bitsgalore avatar Feb 22 '18 12:02 bitsgalore

Just to make things more confusing, I downloaded it and it works fine for me!

Jhove (Rel. 1.14.0, 2016-10-06)
 Date: 2018-02-22 12:49:27 GMT
 RepresentationInformation: /Users/andy/Downloads/mets-test.xml
  ReportingModule: UTF8-hul, Rel. 1.6 (2014-07-18)
  LastModified: 2018-02-22 12:49:17 GMT
  Size: 106387
  Format: UTF-8
  Status: Well-Formed and valid
  MIMEtype: text/plain; charset=UTF-8
  UTF8Metadata: 
   Characters: 106386
   UnicodeBlocks: Basic Latin, Latin-1 Supplement
   LineEndings: LF

But note that's JHOVE 1.14.6 (from Homebrew).

Is the real problem the fact that it's failing to download the XSDs? e.g. because it's not picked up the proxy?

anjackson avatar Feb 22 '18 12:02 anjackson

@anjackson I think you're on to something: I just re-ran JHOVE on that file (and some other ones that were giving me this problem) and it's now working for me as well (same JHOVE versions, same files, on both Windows and Linux machine)! So yes the cause might well be JHOVE failing to download the XSDs. However is this is so, I'd really expect that JHOVE would tell me this instead of marking it as "Not well-formed" (for one thing, the XSDs are not needed to check for well-formedness).

bitsgalore avatar Feb 22 '18 13:02 bitsgalore

@bitsgalore I can't remember the details, but I've hit problems before with JHOVE giving really weird errors when remote XSDs have not been available. One of those times when I wondered what the advantage of running JHOVE is, compared to xmllint/etc.

anjackson avatar Feb 22 '18 13:02 anjackson

Update to the above: as an additional test I re-ran JHVOVE after disabling my network connection. I expected that this would reproduce my original error. Instead, JHOVE parsed the document correctly and reported it as "Well-formed, but not valid", indicating in an InfoMessage that the schema could not be read. Which only makes things even more puzzling ...

@anjackson As for the added value of JHOVE over xmllint: xmllint doesn't automatically fetch the XSDs, so you have to specify an XSD on the command-line (I think you even need to download a local copy of the XSD, but I'm not 100% sure; also I don't remember how/if xmllint handles multiple XSD definitions). This makes xmllint a massive pain in the ass with things like METS files, and JHOVE makes handling these a lot easier. That is, until you end up running into weird problems like this one!

bitsgalore avatar Feb 22 '18 13:02 bitsgalore

@bitsgalore in that case,xmlstarlet val FTW!

anjackson avatar Feb 22 '18 13:02 anjackson

@anjackson may I point out that you accidentally used the UTF8-hul, not XML-hul? So nothing about XML validity here.

But anyway, I also validated the file with JHOVE 1.18.1 without web access i.e., no schema files available. It works as expected: JHOVE reports the file to be well-formed, but not valid.

marhop avatar Feb 22 '18 14:02 marhop

Hah! Thanks @marhop - JHOVE's behaviour still confusing me after all these years. You think I'd know by now.

EDIT: Note that the times I've had trouble with JHOVE downloading XSD is not when they are simply not available (that's fine), but when the server returns not-XML.

anjackson avatar Feb 22 '18 14:02 anjackson

@marhop good call, I had overlooked that in @anjackson's answer as well!

@anjackson just had a look at xmlstarlet, but just like xmllint it needs a reference to the schema as a command line arg. Also it's not clear to me how it handles multiple schemas (if at all?).

bitsgalore avatar Feb 22 '18 14:02 bitsgalore

@bitsgalore Really!? Sorry, I thought I'd checked that, although it was a long time ago. My apologies for misremembering.

anjackson avatar Feb 22 '18 14:02 anjackson

@anjackson no prob. Incidentally it does handle remote XSDs (and so does xmllint, I now see); looks like both tools are really similar.

bitsgalore avatar Feb 22 '18 14:02 bitsgalore

Some more comments. Indeed, from my experience, xmllint or xmlstartlet don't cope very well with multiple schemas. One way to make them work is to create a wrapper.xsd which import all the namespace you need and then call xmllint with the schema option.

Moreover, relying on external URL to validate xml files is not very safe: currently, the loc site is having problems (I get error 500 when asking for http://www.loc.gov/standards/mets/mets.xsd) The best way to handle schemas is to have a local copy of every xsd and implement a catalog. In Jhove, you can parameter that in the jhove.conf

 <module>
  <class>edu.harvard.hul.ois.jhove.module.XmlModule</class>
  <param>withTextMD=true</param>
  <param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param>
 </module>

This is roughly documented here and probably more should be done here.

FWIIW, for XML files, we use Jhove to extract information (the textMD structure) and then we validate the XML with Xerces coupled with a catalog resolver, fed with the schemas we have decided to import locally. A XML file using an unknown schemas is just checked for well-formedness.

tledoux avatar Feb 22 '18 18:02 tledoux