droid
droid copied to clipboard
ZIP container wrongly identified because of its content
A ZIP File got identified not only as ZIP (correct) but also as PDF (wrong). This appears to happen because a PDF File is inside of the ZIP Container. Droid reports the PDF-Signature instead of the ZIP-Signature.
I just downloaded the newest Droid Version 6.4. I am using Droid on Windows 7. It happened with "Maximum Bytes to Scan" = -1 and with the Default Value. It also happens with Droid in Rosetta.
I show a screenshot of Droid here:
The example file is called "SupplementaryMaterials.zip" (189 MB). You may download it from https://www.research-collection.ethz.ch/handle/20.500.11850/200243
We have a similar Problem with a LaTeX file inside of a ZIP file (fmt/280 (LaTex) inside of a x-fmt/263 (Zip)), but the corresponding example file is not open access.
There used to be a similar issue with Rosetta-DROID: https://basecamp.com/2275980/projects/12621845/messages/64684092 . However, this issue should have been solved with the Format Library Update in March 2017 (v. 5.1088).
Thank you for this report Roland. I have downloaded this file. It seems to be occurring where the content of the ZIP file is uncompressed and the content includes a file where PRONOM's signature allows a potentially variable BOF fragment - in this instance the PDF tag begins at offset 132 and the corresponding signature expects to see it within 144 bytes. I suspect it will be similar for the LaTex file, for which the identification signature is also allowing for a variably positioned BOF fragment
This specific instance could likely be solved with additional priority setting within PRONOM but I'll discuss it internally in case there's an additional more efficient solution we could apply.
David
We just got zipped LaTeX file (fmt/280 ) with the same issue. The attached file is classified by DROID as ZIP and Latex, but should be classified as ZIP only. bg-8-1181-2011_LaTeX_aufMac_vonAna_11Jan_20018_entpacktAufPC.zip
In general, all archive formats which don't obscure their contents will have this issue. The binary signature will often match both the containing format and the contents (sometimes depending on how far into the file the contents appear).
It's not necessarily a good idea to make a zip file take priority over a PDF file, or a latex file, or potentially any other kind of file. At least, it's not a good idea to do this on a format by format basis, gradually changing format priorities as different combinations are encountered in the wild.
One way might be to give archive file formats an automatically privileged basis. Once you know something is an archive file format, and you can parse the file as that format, it's that file format (no matter what else you find in it - unless it's a container format of course).