fido icon indicating copy to clipboard operation
fido copied to clipboard

Fido identifying some XLSX, PPTX, and DOCX as fido-fmt/{x}

Open ross-spencer opened this issue 5 years ago • 5 comments

Dev Effort

1D

Description

Via @sromkey the MS-Office Open XML files in this Archivematica test data zip are being identified as fido-fmt/{x} in Fido:

ross-spencer@artefactual:~/git/artefactual-labs/am/src/archivematica-sampledata/SampleTransfers/OfficeDocsExtracted/objects$ fido *
FIDO v1.3.12 (formats-v94.xml, container-signature-20180920.xml, format_extensions.xml)"
OK,14,fido-fmt/189.ppt,"Microsoft Office Open XML - Powerpoint","Microsoft Office Open XML - Powerpoint",47215,"MS-OfficeOpenXML-samples/samplepptx.pptx","None","signature"
OK,10,fido-fmt/189.word,"Microsoft Office Open XML - Word","Microsoft Office Open XML - Word",14860,"MS-OfficeOpenXML-samples/sampledocx.docx","None","signature"
OK,11,fido-fmt/189.xl,"Microsoft Office Open XML - Excel","Microsoft Office Open XML - Excel",12050,"MS-OfficeOpenXML-samples/samplexlsx.xlsx","None","signature"
FIDO: Processed      9 files in 343.28 msec, 26 files/sec

If the fido-fmt{x} entries are removed as per here: https://github.com/openpreserve/fido/issues/36#issuecomment-23932419 then the closest match seems to be generic OOXML:

ross-spencer@artefactual:~/Desktop/temp/ndsa/office-samples-and-skeletons/samples$ fido *
FIDO v1.3.12 (formats-v94.xml, container-signature-20180920.xml, format_extensions.xml)
OK,150,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",14860,"sampledocx.docx","None","signature"
OK,8,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",47215,"samplepptx.pptx","None","signature"
OK,9,fmt/189,"Microsoft Office Open XML","Microsoft Office Open XML",12050,"samplexlsx.xlsx","None","signature"
FIDO: Processed      3 files in 206.92 msec, 14 files/sec

Unfortunately the Skeleton Suite looks like it won't help debug here as the extracted samples (three per puid) all identify correctly.

I have extracted the samples and the skeleton files here for easy access.

NB. Also noted by Sarah is that Siegfried will identify the formats correctly:

ross-spencer@artefactual:~/git/artefactual-labs/am/src/archivematica-sampledata/SampleTransfers/OfficeDocsExtracted/objects$ sf *
---
siegfried   : 1.7.11
scandate    : 2019-02-24T12:22:11+01:00
signature   : default.sig
created     : 2019-02-16T11:10:03+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V94.xml; container-signature-20180917.xml'
---
filename : 'MS-OfficeOpenXML-samples/sampledocx.docx'
filesize : 14860
modified : 2007-08-14T23:29:00+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/412'
    format  : 'Microsoft Word for Windows'
    version : '2007 onwards'
    mime    : 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
    basis   : 'extension match docx; container name [Content_Types].xml with byte match at 460, 94 (signature 1/3)'
    warning : 
---
filename : 'MS-OfficeOpenXML-samples/samplepptx.pptx'
filesize : 47215
modified : 2007-08-14T23:51:16+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/215'
    format  : 'Microsoft Powerpoint for Windows'
    version : '2007 onwards'
    mime    : 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
    basis   : 'extension match pptx; container name [Content_Types].xml with byte match at 2326, 96 (signature 1/3)'
    warning : 
---
filename : 'MS-OfficeOpenXML-samples/samplexlsx.xlsx'
filesize : 12050
modified : 2007-08-14T23:50:24+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/214'
    format  : 'Microsoft Excel for Windows'
    version : '2007 onwards'
    mime    : 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
    basis   : 'extension match xlsx; container name [Content_Types].xml with byte match at 676, 88 (signature 1/3)'
    warning : 
---

ross-spencer avatar Feb 24 '19 12:02 ross-spencer

If this is an error for these handful of files but not other files of its type, is the outcome better that FIDO should return the generic Microsoft OOXML with a standard PRONOM fmt/189 ID, rather than the custom fido-fmt ID?

Asking because I get the same results in master but don't necessarily have the bandwidth to fully investigate and change and test a larger solution for these Microsoft files, but I can remove the custom fido-fmts which will produce fmt/189 results (better for preservation..?)

ablwr avatar Oct 29 '19 17:10 ablwr

@carlwilson I investigated this using commit https://github.com/openpreserve/fido/commit/6211d663fd933dcb5e14bada86ed40281ab816b8 of the rc/1.6 branch using signature versions FIDO v1.4.1 (formats-v97.xml, container-signature-20200121.xml, format_extensions.xml) and the office-samples-and-skeletons.zip file shared by Ross.

From what I can see for the files in the office-samples-and-skeletons/samples directory fido finds three signatures. For example for the samplexlsx.xlsx file in it the match_formats method initially gets a list similar to:

 [('x-fmt/263', 'ZIP format'), ('fmt/189', 'Microsoft Office Open XML'), ('fido-fmt/189.xl', 'Microsoft Office Open XML - Excel')]

Then the priority logic determines that ('fido-fmt/189.xl', 'Microsoft Office Open XML - Excel') from the format_extensions.xml file is the best match.

The difference with the files in the office-samples-and-skeletons/skeleton directory is that only one signature is found and that makes fido to detect the formats using the container signature file container-signature-20200121.xml instead. For example for the fmt-214-container-signature-id-2030.xlsx file in it the match_formats method gets a list similar to:

[('x-fmt/263', 'ZIP format')]

From it a container type ZIP is determined getting the format from the [Content_Types].xml file contained in the xlsx file.

Do you have any advice on how to proceed with this?

replaceafill avatar Apr 19 '21 18:04 replaceafill

Hackathon 2023 Review: Selected for initial tasks. @replaceafill, sorry to do this again, but you're already here. I suggest prioritising this over #94, as it's likely a quicker win.

carlwilson avatar Jul 17 '23 13:07 carlwilson

@carlwilson if we remove these custom fido-fmt/... entries from format_extesions.xml to get fmt/189 for all the mentioned sample files as explained by Ross and Ashley above, what would be an appropriate way to write a test for that?

replaceafill avatar Jul 20 '23 21:07 replaceafill

That's a good question @replaceafill and one I'm a little too busy to think about right now. Feel free to have a think and suggest something, if not I'll give this some serious thought week starting 31/7.

carlwilson avatar Jul 26 '23 08:07 carlwilson