jhove
jhove copied to clipboard
JHOVE module selection issues
Dev Effort
4D
Description
The current process to determine which module is selected to validate a given file suffers from issues which can easily result in a misleading report of the file's validity and the hiding of serious and otherwise discoverable format problems.
Problem
Currently, JHOVE validates a file using one module after another, discarding reports until a module declares the file to be well-formed, after which no other modules are checked and the single successful report is delivered to the user. If no other modules report the file as well-formed, it will eventually reach the built-in, higher-level BYTESTREAM module, which will always declare the file as a well-formed bytestream. This has the effect of no malformed files ever being correctly reported as malformed.
The main problems here are that JHOVE assumes that a) if a module declares a file as malformed, that equates to the format being incompatible with the module; and b) only one module could be applicable for a given format. However, both are false assumptions given that a file could be both compatible and malformed, and any number of modules could be loaded into JHOVE which overlap in their format support (e.g. UTF-8's and XML's, or BYTESTREAM's and most other modules).
Additionally, users unfamiliar with JHOVE's underlying architecture can easily be unaware that a change of module could occur (say mid-batch), since they are guaranteed to receive a validity report anyway, which is often their (or their script's) primary concern. Little attention is paid, in our experience, to which module generated the report or even to whether the format field may have changed from what they expected it to be.
To avoid the module unpredictability, and files being reported as well-formed bytestreams instead of malformed files of the correct types, one must sidestep the described logic by forcing JHOVE to only use the expected formats' relevant modules (via the -m
flag). Determining which modules should apply to which formats then becomes a process the user needs to implement instead of being able to rely on any of JHOVE's internal knowledge of the modules and their formats.
While I think users should ultimately be making conscious decisions about which modules they trust and use for particular formats, I also think the current logic could be greatly improved for both the naive default configuration and the customized use-case.
Proposed Solution
I suggest we change the logic to something like the following:
If a file can be validated by multiple loaded modules, then JHOVE collects a report from each and returns them all as part of the file's representation information. Whether a module can recognize a format or not would be discovered via a query to each module's internal format identification logic (similar to what already exists in some module's checkSignatures
methods). Modules currently lacking identification logic could still return an answer inline with the old heuristic (if the module considers the file to be well-formed then it is compatible with the format) until they can be refactored to include it.
The above logic would prevent higher-level format reports (such as the BYTESTREAM module's) from hiding lower-level ones, and more predictably utilize all the modules a user has chosen to load via JHOVE's configuration file, instead of trying to guess a single best-fit module on a per-file basis. I think it would also make users more aware of their need to decide upon which module(s) they choose to use when more than one can apply. An added benefit would be the ability to easily extract information from the output of multiple complementary modules, should some provide details lacking in others.
Considerations
- The proposed solution would be a breaking change to the structure of JHOVE's current set of reports.
- Modules which currently lack separate format identification logic would need to have it created in order to be found applicable for files which they would consider compatible but malformed.
Related to #476. I was just experimenting with this. Depending on how loosely coupled a workflow is, i.e. if there isn't a format identification step, which then calls a specific validation module in JHOVE, then relying on JHOVE to match, can sometimes result in a BYTESTREAM match which is ambiguous, and well, doesn't say much.
Acknowledging their differences, the JHOVE2 project in the past incorporated DROID seemingly to counter exactly this problem. Which could still be a possible solution? E.g bringing in Nanite?
For my testing, I was using the most recent Skeleton Suite. I guess it's an extreme approach to fuzzing! :sweat_smile: . I wasn't expecting much to come out of it. If the formats were to be parsed by JHOVE properly they would be almost all but guaranteed to fail validation in someway; so no criticism of the tools here. I need to investigate the parse functions of each of the modules more. But yep, while each of the files in the JHOVE specific collection below will identify positive in DROID, they're pretty much all parsed as BYTESTREAM when JHOVE is left up to its own devices. Again this makes sense, it leaves it up to the developer pulling JHOVE + other tools together to route the objects through the correct module. Some method of bringing PRONOM-like tools and JHOVE together would however be really cool.
JHOVE specific skeleton-files
We might not need to add an external dependency for identification purposes since each module already contains the logic necessary to validate the signatures of the formats they support. I believe JHOVE merely needs a standard way to trigger those checks in each module to discover which modules are compatible.
That's interesting David. Reading back, I see what you mean more clearly now. I'm interested to delve into the identification component so will do that and try and learn a bit more.
@david-russo am I correct that an example of a signature matching might be a JPEG matching these three bytes? There seems to be some optional logic to match against extension too?
If the suggestion is to match a file with a signature to aid in routing it through a certain module, that seems to be the purest/most standard approach to this. I'd suggest trying to use modules in an overlapping way though is a different solution to a different problem than say what #476 is saying too. It would be great to see the matching issue solved in a release first before exploring the output using multiple complimentary modules. It would lend itself to a kinder integration I think, as you allude to, noting it would be a breaking change.
Yes, another example of signature matching can be seen here in the Wave module. And more compartmentalized in the GIF module here and, apparently, here.
But more importantly for how JHOVE currently works: as long as a module calls RepInfo.setSigMatch()
at some point, it is considered to match the file, however it decides to determine that match. (Some modules, such as UTF-8, need to parse the entire file just to see if they match, for instance.) So in this way JHOVE already supports signature matching, it's just that it currently requires running each module's complete validation process over a file. My earlier suggestions around signature matching were just to optimize that process, because my solution to the larger problem of how JHOVE chooses between multiple matching/overlapping modules (which it's already doing, but poorly) would otherwise require running the entire validation process of every module over every file to gather those setSigMatch
calls.
Anyway, apologies for the logorrhoea, but what I'm trying to say is that signature matching isn't really the issue causing #476. The issue is that even when JHOVE knows which modules it can use through its existing identification logic, it currently only delivers one report, so it has to make a decision on which report to return, and it currently decides based purely on which is the first to return a status of "well-formed". In the case of #476, this heuristic leads to the BYTESTREAM module's report being returned, even though a more applicable module may have been known, just because it reported the file as well formed.
This problem is often seen with the BYTESTREAM module because it matches all file types, but it isn't unique to it. The same problem could occur if you have an XML file, and both the XML module and UTF-8 module were enabled: if the XML file was of the correct UTF encoding, but had a malformed XML structure, JHOVE would only report the file as a well-formed UTF-8 file and not mention the malformed XML. (And if both of those modules found errors in the file, then all the user would be told is that it's a well-formed byte stream.) So that's the problem this issue is trying to address.
An improvement to the situation which wouldn't require breaking changes might be to hard-code a special case for the BYTESTREAM module so that it's only ever used when no other modules return a signature match. This would greatly diminish the number of occurrences, but wouldn't solve the problem for any other overlapping modules, such as XML and UTF-8. Alternatively, removing the BYTESTREAM module from validation entirely would also give the same results, have other benefits (as explained in #396), and could also be done without introducing any breaking changes or special cases.
Do either of those sound any better as a first step to a more complete solution?