Improve XMP metadata handling
This PR changes how XMP metadata is validated in TIFF, GIF, JPEG, and PDF.
Prior to this PR, an error with respect to XMP (e.g., TIFF-HUL-14) would be raised if and only if the XMP metadata was enclosed in a so-called packet wrapper that contained an encoding attribute to declare the encoding of the XMP data. This was problematic for a couple of reasons:
- A packet wrapper is a pair of XML processing instructions that is intended to facilitate scanning a byte stream of unknown format for XMP metadata by enclosing the actual XML data with very specific marker strings, similar to magic numbers used for file format identification. However, a packet wrapper is not recommended (albeit not illegal) if the location of XMP metadata in a file is well-defined. This is the case in all of TIFF, GIF, JPEG, and PDF, so JHOVE could just as well ignore the packet wrapper. (Adobe XMP Specification Part 1 (2012), pages 10-11)
- The encoding attribute has been deprecated since at least 2004. (Adobe XMP Specification (2004), page 30)
- Moreover, XMP metadata in all of TIFF, GIF, JPEG, and PDF is explicitly required to use UTF-8 since at least 2010, so JHOVE does not need the encoding attribute anyway. (Adobe XMP Specification Part 3 (2010), this hasn't changed in the current version)
- There was a bug in the code that handled the encoding string which had been copy/pasted everywhere where XMP was processed, so XMP validation seems to never have worked as intended. (see for example 7d540205c736d9847f85b7c926b7042c20ad7114)
- As a result, the only way to get an XMP-related error was a (not recommended) packet wrapper with a (deprecated) encoding attribute. The actual XML that is the XMP metadata was never checked at all.
With this PR XMP metadata is now checked as follows:
- A packet wrapper and in particular its encoding attribute are ignored because they are irrelevant, but not strictly illegal.
- XMP metadata is expected to be encoded in UTF-8 because this is prescribed for the file formats JHOVE is dealing with. Other encodings will raise an error.
- XMP metadata is checked for well-formed XML, so fundamentally broken XML will raise an error. However, no XML validation with respect to a schema is performed because due to the extensibility of XMP this would lead to a lot of validation failures when custom schema files are not available.
- Files containing broken XMP metadata are rated as "well-formed, but not valid". I'm not sure whether this conforms to JHOVE's wider policy but it seemed sensible to me because I cannot imagine broken XMP leading to serious issues. IMHO, even a mere warning/info would be enough.
Note that I added two new error IDs to account for invalid XMP metadata, GIF-HUL-11 and JPEG-HUL-15. I will add them to the Wiki if/when this PR is accepted.
Cheers, Martin
Ah, I see. CI does more than just mvn test. Let me look into this and get back to you ...
OK, sorry it took so long - priorities ... Now I finally found some time to investigate the bbt-jhove CI failures. Not to sound presumptuous but I think none of the errors indicate a problem with the changes in this PR.
First, several JPEG2000 files lead to an unhandled EOFException ...
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1911.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1951.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-2021.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1971.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-2011.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1984.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1961.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1920.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1937.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/meth_is_2_no_icc.jp2
- test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1999.jp2
... or an unhandled NullPointerException
- test-root/corpora/errors/modules/JPEG2000-hul/openJPEG15.jp2
However, since I haven't messed with JPEG2000 in this PR and since these exceptions are thrown by the much older JHOVE v1.28 as well I deny any responsibility for them. ;-)
Second, two PDF files do indeed fail the tests because validating them doesn't lead to the expected results.
- test-root/targets/1.34/errors/modules/PDF-hul/pdf-hul-43-govdocs-486355.pdf.jhove.xml
- Expected child nodelist length '23' but was '17' - comparing <repInfo...> at /jhove[1]/repInfo[1] to <repInfo...> at /jhove[1]/repInfo[1]
- test-root/targets/1.34/errors/modules/PDF-hul/pdf-hul-22-govdocs-000187.pdf.jhove.xml
- Expected child nodelist length '23' but was '17' - comparing <repInfo...> at /jhove[1]/repInfo[1] to <repInfo...> at /jhove[1]/repInfo[1]
This might be because JHOVE now detects erroneous XMP that it previously wasn't able to find, obviously changing the validation results. And in fact, JHOVE now throws PDF-HUL-101 errors for these files. However, in these specific cases this is not actually caused by invalid XMP data but by the fact that the XMP data is stored in encrypted content streams that JHOVE cannot handle. JHOVE doesn't decrypt the content streams but happily feeds the encrypted data to its XML parser which of course throws an exception that ultimately leads to the PDF-HUL-101 errors ...
From the XMP validation perspective this behaviour appears to be correct - the encrypted data isn't valid XMP but a seemingly random bitstream. This is of course far from perfect; I would rather know that JHOVE cannot read the content stream instead of being told that the XMP is invalid. However, dealing with encrypted content streams in general is very much out of scope of this PR, so I suggest you accept it as it is.
PS: We could of course stop validating XMP altogether if the content stream is encrypted. But as far as I can tell from the source code this cannot be reliably determined at the moment because the PdfModule._streamsEncrypted attribute is only set to true if the encryption dictionary contains the (optional) StmF key. This is probably another story that should not make this PR even larger ...
Hi @marhop I'm starting to look at these test errors. Once I have some confidence in the results I'll patch the tests and look to merge this.