fido
fido copied to clipboard
Fido identifying some XLSX, PPTX, and DOCX as fido-fmt/{x}
Dev Effort
1D
Description
Via @sromkey the MS-Office Open XML files in this Archivematica test data zip are being identified as fido-fmt/{x}
in Fido:
ross-spencer@artefactual:~/git/artefactual-labs/am/src/archivematica-sampledata/SampleTransfers/OfficeDocsExtracted/objects$ fido *
FIDO v1.3.12 (formats-v94.xml, container-signature-20180920.xml, format_extensions.xml)"
OK,14,fido-fmt/189.ppt,"Microsoft Office Open XML - Powerpoint","Microsoft Office Open XML - Powerpoint",47215,"MS-OfficeOpenXML-samples/samplepptx.pptx","None","signature"
OK,10,fido-fmt/189.word,"Microsoft Office Open XML - Word","Microsoft Office Open XML - Word",14860,"MS-OfficeOpenXML-samples/sampledocx.docx","None","signature"
OK,11,fido-fmt/189.xl,"Microsoft Office Open XML - Excel","Microsoft Office Open XML - Excel",12050,"MS-OfficeOpenXML-samples/samplexlsx.xlsx","None","signature"
FIDO: Processed 9 files in 343.28 msec, 26 files/sec
If the fido-fmt{x}
entries are removed as per here: https://github.com/openpreserve/fido/issues/36#issuecomment-23932419 then the closest match seems to be generic OOXML:
ross-spencer@artefactual:~/Desktop/temp/ndsa/office-samples-and-skeletons/samples$ fido *
FIDO v1.3.12 (formats-v94.xml, container-signature-20180920.xml, format_extensions.xml)
OK,150,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",14860,"sampledocx.docx","None","signature"
OK,8,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",47215,"samplepptx.pptx","None","signature"
OK,9,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",12050,"samplexlsx.xlsx","None","signature"
FIDO: Processed 3 files in 206.92 msec, 14 files/sec
Unfortunately the Skeleton Suite looks like it won't help debug here as the extracted samples (three per puid) all identify correctly.
I have extracted the samples and the skeleton files here for easy access.
NB. Also noted by Sarah is that Siegfried will identify the formats correctly:
ross-spencer@artefactual:~/git/artefactual-labs/am/src/archivematica-sampledata/SampleTransfers/OfficeDocsExtracted/objects$ sf *
---
siegfried : 1.7.11
scandate : 2019-02-24T12:22:11+01:00
signature : default.sig
created : 2019-02-16T11:10:03+01:00
identifiers :
- name : 'pronom'
details : 'DROID_SignatureFile_V94.xml; container-signature-20180917.xml'
---
filename : 'MS-OfficeOpenXML-samples/sampledocx.docx'
filesize : 14860
modified : 2007-08-14T23:29:00+02:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/412'
format : 'Microsoft Word for Windows'
version : '2007 onwards'
mime : 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
basis : 'extension match docx; container name [Content_Types].xml with byte match at 460, 94 (signature 1/3)'
warning :
---
filename : 'MS-OfficeOpenXML-samples/samplepptx.pptx'
filesize : 47215
modified : 2007-08-14T23:51:16+02:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/215'
format : 'Microsoft Powerpoint for Windows'
version : '2007 onwards'
mime : 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
basis : 'extension match pptx; container name [Content_Types].xml with byte match at 2326, 96 (signature 1/3)'
warning :
---
filename : 'MS-OfficeOpenXML-samples/samplexlsx.xlsx'
filesize : 12050
modified : 2007-08-14T23:50:24+02:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/214'
format : 'Microsoft Excel for Windows'
version : '2007 onwards'
mime : 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
basis : 'extension match xlsx; container name [Content_Types].xml with byte match at 676, 88 (signature 1/3)'
warning :
---
If this is an error for these handful of files but not other files of its type, is the outcome better that FIDO should return the generic Microsoft OOXML with a standard PRONOM fmt/189 ID, rather than the custom fido-fmt ID?
Asking because I get the same results in master but don't necessarily have the bandwidth to fully investigate and change and test a larger solution for these Microsoft files, but I can remove the custom fido-fmts which will produce fmt/189 results (better for preservation..?)
@carlwilson I investigated this using commit https://github.com/openpreserve/fido/commit/6211d663fd933dcb5e14bada86ed40281ab816b8 of the rc/1.6
branch using signature versions FIDO v1.4.1 (formats-v97.xml, container-signature-20200121.xml, format_extensions.xml)
and the office-samples-and-skeletons.zip
file shared by Ross.
From what I can see for the files in the office-samples-and-skeletons/samples
directory fido
finds three signatures. For example for the samplexlsx.xlsx
file in it the match_formats
method initially gets a list similar to:
[('x-fmt/263', 'ZIP format'), ('fmt/189', 'Microsoft Office Open XML'), ('fido-fmt/189.xl', 'Microsoft Office Open XML - Excel')]
Then the priority logic determines that ('fido-fmt/189.xl', 'Microsoft Office Open XML - Excel')
from the format_extensions.xml
file is the best match.
The difference with the files in the office-samples-and-skeletons/skeleton
directory is that only one signature is found and that makes fido
to detect the formats using the container signature file container-signature-20200121.xml
instead. For example for the fmt-214-container-signature-id-2030.xlsx
file in it the match_formats
method gets a list similar to:
[('x-fmt/263', 'ZIP format')]
From it a container type ZIP
is determined getting the format from the [Content_Types].xml
file contained in the xlsx file.
Do you have any advice on how to proceed with this?
Hackathon 2023 Review: Selected for initial tasks. @replaceafill, sorry to do this again, but you're already here. I suggest prioritising this over #94, as it's likely a quicker win.
@carlwilson if we remove these custom fido-fmt/...
entries from format_extesions.xml
to get fmt/189
for all the mentioned sample files as explained by Ross and Ashley above, what would be an appropriate way to write a test for that?
That's a good question @replaceafill and one I'm a little too busy to think about right now. Feel free to have a think and suggest something, if not I'll give this some serious thought week starting 31/7.