droid
droid copied to clipboard
sigtool behaviour with inverted ranges
If you look at PRONOM fmt/142 the raw signature is 52494646{4}57415645666D7420[!10]{3}[!FEFF]{16-*}64617461
This is decomposed within the PRONOM binary signature file as:
<ByteSequence Endianness="Big-endian" Reference="BOFoffset">
<SubSequence Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">
<Sequence>57415645666D7420</Sequence>
<LeftFragment MaxOffset="4" MinOffset="4" Position="1">52494646</LeftFragment>
<RightFragment MaxOffset="0" MinOffset="0" Position="1">[!10]</RightFragment>
<RightFragment MaxOffset="3" MinOffset="3" Position="2">[!FEFF]</RightFragment>
</SubSequence>
<SubSequence Position="2" SubSeqMinOffset="16">
<Sequence>64617461</Sequence>
</SubSequence>
</ByteSequence>
If I use sigtool to generate the XML instead, I get this...
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">
<Sequence>57415645666D7420</Sequence>
<LeftFragment MaxOffset="4" MinOffset="4" Position="1">52494646</LeftFragment>
<RightFragment MaxOffset="0" MinOffset="0" Position="1">10</RightFragment>
<RightFragment MaxOffset="3" MinOffset="3" Position="2">[00:FD]</RightFragment>
</SubSequence>
<SubSequence Position="2" SubSeqMaxOffset="16" SubSeqMinOffset="16">
<Sequence>64617461</Sequence>
</SubSequence>
</ByteSequence>
Note that for the first RightFragment, the value has become '10' rather than [!10] thereby inverting the logic.
This can be reproduced more simply by running commands like:
sigtool [!10] sigtool [!10:12]
which give the results: 10 [10:12]
...without the necessary exclamation mark, and thereby inverting the logic.
Although I mention sigtool here, I believe this is calling core DROID code.
Note that fmt/142 includes the troublesome and ambiguous [!FEFF] string, but this issue isn't related to that.
It was suggested that @nishihatapalmer might be interested in this behaviour.
Thanks - I'll have a look at it.
Bug is confirmed, and exists in the ByteSequenceSerializer in droid-core, in the method toPRONOMExpression().
This is failing to add inverted syntax when bytes or ranges are inverted. Fix should be fairly simple hopefully. The test suite could also be improved to ensure the standard syntax is serialized correctly in all cases.
As a slight aside, I notice in the second RightFragment in the example given we have:
<RightFragment MaxOffset="3" MinOffset="3" Position="2">[!FEFF]</RightFragment>
which is converted into:
<RightFragment MaxOffset="3" MinOffset="3" Position="2">[00:FD]</RightFragment>
But this is in fact correct, since not having the bytes FF and FE is the same as having the range 00:FD. The compiler generally attempts to find the most efficient construction for patterns, and it prefers ranges to inverted sets of bytes.
Just to flag, although it was a little before my time as researcher, I interpret the intent of [!FEFF] to be 'a 16 bit byte value that does not equal 0xFE FF'
On page 8-9 here: https://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf
• [!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte sequence containing no wildcards).
and : • [!a:b]: wildcard matching any sequence of bytes which does not lie lexicographically between a and b, inclusive (where a and b are both byte sequences of the same length, containing no wildcards, and where a is less than b).
So for me, if the intent is 'neither 0xFE or FF' then this would be correctly expressed as [!FE:FF].
If [!FEFF] (as I believe it is trying to express) cannot be produced as such by PRONOM/handled as such by DROID, then I would instead look to express that sequence as ...FE[!FF]... which would have the same effect.
fmt/142 signature was added in v58 of PRONOM, March 2012. It would be useful to seek clarity on exactly what was meant here...
actually on further reflection 'FE[!FF]' wouldnt work...neither would '??[!FF]' - hmm
honestly I think [!FEFF] is the best and correct way to express that
Interesting - you are correct that the spec says a sequence of bytes can be inverted, not just a single byte.
I don't think DROID has ever implemented this though. The reason lies in how DROID searches for matches - and this has been true since the earliest versions of DROID, which used a search technique called Boyer-Moore-Horspool. It's always possible to search for "not a byte" - because this is the same as searching for all the other bytes in that position.
However, searching for "not a sequence" isn't possible with these search algorithms. You can't search for "not the first byte" followed by "not the second byte", because it's also valid that the first byte genuinely matches the first byte and only the second byte doesn't match, or vice versa. So you'd have to use a completely different form of searching, and DROID has never done that.
As another observation, the syntax used by DROID here is the standard square bracket notation used by regular expressions to indicate a set of bytes at a single position, not a sequence of bytes. For that syntax to also encompass sequences as well as sets would be extremely non standard, and certainly not what anyone used to regular expressions would expect.
I could be wrong however. Are there signatures that assume that we are matching a sequence here rather than a set?
If so, the signature will be technically broken (but will still work I guess as it's been tested to match and it will match one of the bytes at that position correctly).
very cool, thanks for the background Matt.
The signature description tells us the intent: 'wformat Tag not equal to decimal 65534' - https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=785&strPageToDisplay=signatures
Of course 65534 is 0xFFFE so I assume endianness is in play here too...
Hmmmm..... The one above also talks about sequences and endianness:
[a:b]: wildcard matching any sequence of bytes which lies lexicographically between a and b, inclusive (where both a and b are byte sequences of the same length, containing no wildcards, and where a is less than b). The endian-ness of a and b are the same as the endian-ness of the signature as a whole.
DROID does not currently support anything other than byte values. The spec says we should support 16 bit values (or even 24 bit, or 32 bit... and so on).
If it was determined that DROID does need to support those features as the spec says, that could be a bit tricky. At the very least, the ways that some sequences are searched for and interpreted would need different code.
Ironically, the signatures which are using those features as the spec says are clearly actually mostly working! In the example you give, this is clearly because it will not accept an FF in the position (which is correct), and then there is a variable gap after it, which covers the additional byte in the 16 bit value. It does mean that if you had a file with FE in the first position it wouldn't match it correctly, even though that should match.
Just created some test files, as attached. DROID confirms its behavour as equivalent to !FE && !FF
Should_be_nope_FEFF.txt Should_be_yep_AA00.txt Should_be_yep_AAFF.txt Should_be_yep_FE00.txt Should_be_yep_FEAA.txt Should_be_yep_FF00.txt
NB these are just skeleton files built to conform to the expected byte sequences, not actual WAV files
At this point, I have to say my preference would be to stick with how DROID currently functions and to update the spec!
First, because the signatures we have clearly work, and have worked for over a decade at least. They could be adjusted in simple ways to make them better given this understanding, I suspect not many signatures actually use that construction.
Second, because adding these features now would actually be a fairly large and tricky piece of work.
Third, because there are no good efficient algorithms to search for "not sequences". You'd literally have to search in every position to determine that at least one byte didn't match in a sequence, even if most of them did.
There may of course be very good reasons to stick with the spec as it is written - I leave those decisions to you!
Not me I don't work for them any more :)
But it's clear fmt/142 isn't working as intended and that there'll be false negatives that are ID'ing as fmt/6 as well as false positives that shouldn't be ID'ing as fmt/142, so this specifically needs a bit of thought. (although from a digital preservation perspective you're unlikely to treat fmt/6 and fmt/142 differently).
The spec itself should be updated in one regard - it still talks about the data model that versions of DROID before v5 used (the positive-specific / positive-tentative) identification of signatures.
That was discarded after v4, as it was felt that we couldn't necessarily assign a degree of confidence to the matches. Yes, container signatures will usually be more accurate than binary signatures, which are probably more accurate than file extensions. But this is still a subjective judgement which may not be true, so instead DROID reported the types of matches it had, and left any value judgements on how good those matches were to the user.
Adjusting those signatures so they don't have false negatives wouldn't be hard. Just take [!XY] and replace it with [!X]??.
The first byte can't be X. The second byte can be any value as long as the first isn't X. This would of course not be as specific as we'd like, but it wouldn't be wrong.
Although... since the existing signatures actually work, you might just get rid of the second byte entirely, or you'd have to adjust any gaps after it (as the current signature is only matching a single byte). So even simpler, replace [!XY] with [!X].
Cool - worth exploring for sure, thank you. Sorry for derailing the [!10] inversion issue!
But it's clear fmt/142 isn't working as intended and that there'll be false negatives that are ID'ing as fmt/6 as well as false positives that shouldn't be ID'ing as fmt/142, so this specifically needs a bit of thought. (although from a digital preservation perspective you're unlikely to treat fmt/6 and fmt/142 differently).
Perhaps the skeleton suite generator can be updated at some-point to provide inverse/negative testing. This seems like a good use-case to create a sample of a format that shouldn't match. It may require reorganizing the output, but should be pretty easy to start writing. I think this is a weakness that was discussed in the original paper but I this thread is helpful in seeing how it can be tested.
@steve-daly The signature development utility http://ffdev.info outputs the correct sequence. Trying to follow this thread, your original issue is simply the signature file is generated incorrectly and affects processing of the file? Is the behavior as expected if you use ffdev.info and process your files in DROID?
Looking at fmt/142 in the context of related formats:
fmt/141: 52494646{4}57415645666D7420100000000100{14-*}64617461
(chunk length == 10 (exactly 16 bytes) and wformat tag == 01)
fmt/142: 52494646{4}57415645666D7420[!10]{3}[!FEFF]{16-*}64617461
(chunk length != 10 and wformat tag != FEFF (65534))
fmt/143: 52494646{4}57415645666D7420{4}FEFF{38-*}64617461
(chunk length not given, but wformat tag == FEFF)
So I believe that if fmt/142 was changed to
52494646{4}57415645666D7420[!10]{21-*}64617461
, and fmt/143 is given priority over fmt/142, then this would provide the intended identification outcome - in effect it doesn't then matter what the wformat tag says for fmt/142, but if it is FEFF then it'll get the fmt/143 outcome
I'll test this a bit, but feel free to sanity check my logic...
Thanks for the useful discussion here. As some background, I work for TNA and we're needing to retire the SQL code that generates PRONOM signatures soon. This is being replaced with some Java code, hopefully to achieve the same output, but this casts light on legacy discrepancies and undefined behaviours.
The developers working on this are currently trialling using some DROID code to generate the new PRONOM signature file and it's generally working well but a few differences arise.
I raised this issue specifically about the missing exclamation marks inverting the sense of the range, as it seemed a simple and clear case.
I had spotted the [!FEFF] point and it caused some internal discussion but it's effectively undefined behaviour and we'll probably want to adjust the signature to use legal syntax, but I can't see we could get the intended behaviour. That said, I'm quite familiar with the WAV format and if fmt/142 is trying to exclude the WAVE_FORMAT_EXTENSIBLE variant, then just doing ??[!FF] would be enough a there are no other valid options for that particular tag that begin with FF (note the endianness means we're better checking the second byte). It looks like someone made fmt/143 first (which should detect the Extensible variant) and then tried to invert it to make fmt/142 for all other variants, but the inversion hasn't worked as intended.
If DROID is interpreting as [!FE:FF] then this would accidentally work fine, and we're lucky that it's before a {*} otherwise it would slip by a byte.
My comment above crossed over with David's. That principle of matching the Extensible variant first (fmt/143) and then falling back to all other variants would work fine I think. I'd like to double check the various offset numbers in those signatures as people may have been variably assuming that [!FEFF] represents either one byte or two. It might all be fine, but I'd like to check that {16-*} against the spec and count out that it was previously right, and not assuming two bytes in [!FEFF]
Thanks Steve, from the signature descriptions on PRONOM itself, fmt/141 is assuming chunk size of 16, fmt/142 is assuming 18+, fmt/143 is assuming 40+ so I believe those assumptions hold with wformat assumed as a pair of bytes
I'm in the process of creating a minimal signature file and skeleton test files - shouldn't take long
Glad to hear that the SQL code is being replaced with Java in PRONOM. Generating text expressions using SQL was never a very nice way to do it!
I'm working on extending the test suite for ByteSequenceSerializer, and will provide a PR with fixes and new tests.
Proposed alterations seem to work - new sig for fmt/142 created with http://ffdev.info/ and used existing sigs for fmt/141 and fmt/143 in minimal sig file as attached, adding fmt/143 priority > fmt/142:
wav_test_file.xml.txt (note that Github didn't want me to upload XML file here so appended .txt to the file - do correct that if downloading for testing purposes)
0100_Should_be_fmt-141.txt AA00_Should_be_fmt-142.txt AAFF_Should_be_fmt-142.txt FE00_Should_be_fmt-142.txt FEFF_Should_be_fmt-143.txt FF00_Should_be_fmt-142.txt FFAA_Should_be_fmt-142.txt
Unless anybody finds any other issue with the above, I'll submit a request to PRONOM team to update sig for fmt/142 and priority for fmt/143 this afternoon
Oh crikey, check out the signature for x-fmt/223 - https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=315&strPageToDisplay=signatures
Sig description: Identifier (0x1991), width of image not equal 320, height of image not equal to 200, X offset not equal to zero, Y offset of image not equal to zero.
Signature:
1991[!4001C80000000000]
Intent here is also clearly !byte-sequence rather than !byte-range. I'll see what else is affected
full list of use of !byte-sequence patterns, accurate to v107:
x-fmt/223 - 1991[!4001C80000000000]
fmt/142 - 52494646{4}57415645666D7420[!10]{3}[!FEFF]{16-*}64617461
x-fmt/12 - 31BE000000AB0000000000000000{82}[!0000]
x-fmt/4 - 32BE000000AB0000000000000000{82}[!0000]
and then these formats use [!00] (for which I believe DROID is behaving as desired, even if sigtool isn't): fmt/363 x-fmt/342 fmt/375 x-fmt/150 fmt/386 fmt/1092
Note that Siegfried Correctly handles the original intent with the original skeletal test files I created - each accurately identifies as fmt/142 except for the file Should_be_nope_FEFF.txt which falls back to fmt/6 because it is too short to ID as fmt/143. ref comment https://github.com/digital-preservation/droid/issues/805#issuecomment-1219951729
This is based on current DROID file, not the minimal one I created for testing an alternative fmt/142 sig
Update on progress.
I think I've fixed the inverted byte and inverted range syntax now. I've also written tests to cover all the other standard syntax, and this revealed some other minor issues (like spacing of some elements), so they're fixed too.
Along the way, I've come across a sequence that fails my tests, but it turns out I don't think PRONOM supports this anyway.
If you ask to match AA ?? ?? ?? ??, you will only get AA out. This is because PRONOM insists that fragments must contain unambiguous byte sequences, and those should be split up using ?? or {n-m}. So if you have a wildcard fragment to the right of AA, there is no other byte sequence to the right of the ??s. So they just get discarded. The specification at https://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf in section 2.2.4 seems fairly clear.
So I'm almost done. I'll do a final round of review and tidying up, then the PR should be ready.