siegfried
siegfried copied to clipboard
Misidentification / FIDO vs. Siegfried
Hi Richard, Here are two compared outputs of this attached Vector Image. In Archivematica, Siegfried is defaulting to identification of this .svg as Generic TXT, which is a problem mainly because the format normalization policies are different (and also it's just incorrect). FIDO, however, ID's this file correctly. See below:
SIEGFRIED OUTPUT -- Siegfried in Archivematica defaults to ID'ing as TXT: archives@archives-ThinkStation-P300:~/Desktop/FPR_Test 2/Image-Vector/SVG$ sf '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg' --- siegfried : 1.1.0 scandate : 2016-03-29T13:10:23-04:00 signature : archivematica.sig created : 2015-05-16T20:44:59+10:00 identifiers :
-
name : 'archivematica' details : 'DROID_SignatureFile_V82.xml; container-signature-20150327.xml; extensions: archivematica-fmt2.xml, archivematica-fmt3.xml, archivematica-fmt4.xml, archivematica-fmt5.xml'
filename : '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg' filesize : 1629 errors : matches : - id : archivematica puid : UNKNOWN format : version : mime : basis : warning : 'no match; possibilities based on extension are fmt/91, fmt/92, fmt/413' archives@archives-ThinkStation-P300:~/Desktop/FPR_Test 2/Image-Vector/SVG$ ^C archives@archives-ThinkStation-P300:~/Desktop/FPR_Test 2/Image-Vector/SVG$
FIDO OUTPUT - in Archivematica, ID's as SVG: archives@archives-ThinkStation-P300:~/Desktop/FPR_Test 2/Image-Vector/SVG$ fido '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg' FIDO v1.3.1 (formats-v81.xml, container-signature-20130501.xml, format_extensions.xml) bad repeat interval bad repeat interval bad repeat interval OK,95,fmt/92,"Scalable Vector Graphics","External",1629,"/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg","image/svg+xml","extension" OK,95,fmt/413,"Scalable Vector Graphics Tiny","External",1629,"/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg","None","extension" FIDO: Processed 1 files in 129.50 msec, 8 files/sec archives@archives-ThinkStation-P300:~/Desktop/FPR_Test 2/Image-Vector/SVG$
Archivematica Report: IDCommand UUID: 8cc792b4-362d-4002-8981-a4e808c04b24 File: (17776d39-5796-4f37-8a1e-40706fd40e8a) /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/FPR_Test_SVG-487653f8-6df5-4835-b30a-90f45b65ff3e/objects/green-blue-70220cf9da9f0b6cff6086e78b69ddfb-2.svg x-fmt/111
Command output: x-fmt/111 /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/FPR_Test_SVG-487653f8-6df5-4835-b30a-90f45b65ff3e/objects/green-blue-70220cf9da9f0b6cff6086e78b69ddfb-2.svg identified as a Generic TXT
Attached file: green-blue.svg.zip
Hi Genevieve thanks very much for this detailed report.
The underlying problem here is the PRONOM byte signatures. The SVG signatures (for PUIDs fmt/91, fmt/92 and fmt/413) all take the form <?xml version="1.0"*<svg
. Your sample file begins with <svg
and is missing that <?xml
declaration so none of the byte signatures match.
Fido reports two svg results (fmt/92 and fmt/413) but not on the basis of a byte match, just on the extension matching (that "extension" in the last field of your Fido report). Siegfried is more conservative than Fido when reporting extension matches: if an extension matches a signature but the format has a byte signature that hasn't matched, then siegfried won't return a result but will instead give UNKNOWN and will list the possible extension matches in the warning field. The rationale for this is that in situations where the extension says one thing, and the file contents say another, it is safest for users to inspect the file and verify.
I think the best solution here is to request an update of the SVG signatures, so that they don't require an xml declaration to match. You can request changes to PRONOM using this form: https://apps.nationalarchives.gov.uk/PRONOM/submitinfo.htm. I've just made this request to the TNA and hopefully this will be amended in a future release of PRONOM.
Another (future) option might be to use a MIME-Info signature file with siegfried. The latest release of siegfried added this option so that you can now choose to use the signatures from the Apache Tika or Freedesktop.org projects, rather than PRONOM. These signatures have better XML detection than PRONOM (because they have signature types that look for the root tags and namespaces of XML files, rather than just treating them as byte streams) and so are more reliable for formats like SVG. The "Try Siegfried" demonstrator at http://www.itforarchivists.com/siegfried now gives both PRONOM and Tika results when it scans and you can see for your sample file that while it is UNKNOWN for PRONOM, the Tika identifier gives a correct match:
This is a new feature of siegfried that hasn't found its way into archivematica yet, but I'm hopeful that in future releases of archivematica you'll get more options about how siegfried is configured.
I hope this all makes sense, and thanks again for the report,
cheers Richard
Thanks for making that request with PRONOM - and for clarifying things!
Just to note - this is still on our backlog and I'll try to address it in our next release, probably around late-May
The latest siegfried
now identifies this sample .svg as 'UNKNOWN'!
I did make a PRONOM request for this in 2016 but the PRONOM signatures all still require an xml declaration that is missing from this file:
So the ID remains unknown and this issue remains open :( I do have an idea to convert PRONOM xml signatures to proper XML signatures that can be matched by siegfried's XML matching algo (which is only used for mime-info signatures like tika at present). That's a piece of work I'm yet to get around to. But it would resolve this issue independent of a PRONOM change and would make PRONOM xml matching better across the board.
Apologies Richard, my fault entirely. I'm aiming for November for v95 and this update will be included.
@Dclipsham You are on FIRE!!!