puremagic icon indicating copy to clipboard operation
puremagic copied to clipboard

Wrong mime types and extension for .xlsx type files

Open Aashutosh05 opened this issue 1 year ago • 1 comments

I wanted to check the extension and return the mime types for files with extensions like .xlsx, .xls, etc. But whenever I am trying to detect it using from_string() function, it is returning .docx as extension and 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' as the mime type. Though from_file() is returning the correct response but still for my use case I wanted to use from_string(). even I tried to write the .xlsx file to a temp file still it is returning .docx as extension unless I explicitly mention the suffix for the temp file.


In [24]: xl_file = "/Users/aashutosh.chaubey/Desktop/static_data/font.xlsx"

In [26]: da = open(xl_file, "rb").read()

In [27]: from_string(da)
Out[27]: '.docx'

In [28]: from_string(da, mime=True)
Out[28]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

In [35]: tmp_fl = tempfile.NamedTemporaryFile(delete=False, suffix='')

In [36]: tmp_fl.write(da)
Out[36]: 6109

In [37]: tmp_fl.name
Out[37]: '/var/folders/fd/_pnhhl3n4d9bnxngcxg8y0xw0000gr/T/tmps9uejuxr'

In [39]: from_file(tmp_fl.name)
Out[39]: '.docx'

In [40]: tmp_fl = tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx')

In [41]: tmp_fl.write(da)
Out[41]: 6109

In [42]: tmp_fl.name
Out[42]: '/var/folders/fd/_pnhhl3n4d9bnxngcxg8y0xw0000gr/T/tmpqa54ado8.xlsx'

In [43]: from_file(tmp_fl.name)
Out[43]: '.xlsx'

In [44]: from_file(tmp_fl.name, mime=True)
Out[44]: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

In [45]: xl_file = "/Users/aashutosh.chaubey/Desktop/static_data/font.xlsx"

In [46]: da = open(xl_file, "rb").read()

In [47]: from_string(da)
Out[47]: '.docx'

In [48]: from_file(xl_file)
Out[48]: '.xlsx'

In [49]: from_file(xl_file, mime=True)
Out[49]: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

In [50]: from_string(da, mime=True)
Out[50]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'`

I ran magic_stream() on the file and I received following as the output:

In [6]: da = open(xl_file, "rb")

In [7]: magic_stream(da)
Out[7]:
[PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.docx', mime_type='application/vnd.openxmlformats-officedocument.wordprocessingml.document', name='MS Office Open XML Format Document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.pptx', mime_type='application/vnd.openxmlformats-officedocument.presentationml.presentation', name='MS Office Open XML Format Document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xlsx', mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', name='MS Office Open XML Format Document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xlsb', mime_type='application/vnd.ms-excel.sheet.binary.macroenabled.12', name='Microsoft Excel - Binary Workbook', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xltm', mime_type='application/vnd.ms-excel.template.macroenabled.12', name='Microsoft Excel - Macro-Enabled Template File', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xltx', mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.template', name='Microsoft Office - OOXML - Spreadsheet Template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xlam', mime_type='application/vnd.ms-excel.addin.macroenabled.12', name='Microsoft Excel - Add-In File', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.docm', mime_type='application/vnd.ms-word.document.macroEnabled.12', name='Microsoft Word - Macro-Enabled Document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.dotx', mime_type='application/vnd.openxmlformats-officedocument.wordprocessingml.template', name='Microsoft Office - OOXML - Word Document Template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.dotm', mime_type='application/vnd.ms-word.template.macroenabled.12', name='Microsoft Word - Macro-Enabled Template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.pptm', mime_type='application/vnd.ms-powerpoint.presentation.macroEnabled.12', name='Microsoft PowerPoint - Macro-Enabled Presentation File', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.potx', mime_type='application/vnd.openxmlformats-officedocument.presentationml.template', name='Microsoft Office - OOXML - Presentation Template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.potm', mime_type='application/vnd.ms-powerpoint.template.macroenabled.12', name='Microsoft PowerPoint - Macro-Enabled Template File', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xlsm', mime_type='application/vnd.ms-excel.sheet.macroEnabled.12', name='Microsoft Excel - Macro-Enabled Workbook', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.zip', mime_type='application/zip', name='PKZIP Archive file', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xpi', mime_type='', name='Mozilla Browser Archive', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.wmz', mime_type='', name='Windows Media compressed skin file', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xpt', mime_type='', name='eXact Packager Models', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.kwd', mime_type='', name='KWord document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xps', mime_type='', name='XML paper specification file', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.jar', mime_type='application/java-archive', name='Java archive', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.odt', mime_type='application/vnd.oasis.opendocument.text', name='OpenDocument template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.odp', mime_type='application/vnd.oasis.opendocument.presentation', name='OpenDocument template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.ott', mime_type='application/vnd.oasis.opendocument.text-template', name='OpenDocument template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.sxd', mime_type='', name='OpenOffice documents', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.sxi', mime_type='', name='OpenOffice documents', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.sxw', mime_type='', name='OpenOffice documents', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.apk', mime_type='', name='Android Application Package', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.cbz', mime_type='application/vnd.comicbook+zip', name='Comic Book Archive (ZIP compression)', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.fb2.zip', mime_type='application/fictionbook2+zip', name='FictionBook 2 eBook file (Zip compressed)', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.fbz', mime_type='application/fictionbook2+zip', name='FictionBook 2 eBook file (Zip compressed)', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.fb3', mime_type='application/fictionbook3+zip', name='FictionBook 3 eBook file', confidence=0.4)]

Aashutosh05 avatar Sep 27 '24 11:09 Aashutosh05

Pure magic only detects files by two methods, first is magic number, which you can see all those file types have the same PK\x03\x04 , and the extension.

Unfortunately Microsoft decided that XLSX and other office files should just be zips (can take any XLSX and unzip them to see) and so it matches any other ZIP magic number.

Would need more advanced file scanning techniques to determine the actual file type, https://github.com/cdgriffith/puremagic/issues/3

cdgriffith avatar Sep 28 '24 15:09 cdgriffith

Should work a lot better in version 2, starting the beta now! https://github.com/cdgriffith/puremagic/releases/tag/2.0.0b1

cdgriffith avatar May 04 '25 22:05 cdgriffith