python-magic
python-magic copied to clipboard
trouble decoding non utf-8 file.
File "/Users/tzeppy/repo/breadsticks/env/lib/python3.7/site-packages/magic.py", line 148, in from_buffer return m.from_buffer(buffer) File "/Users/tzeppy/repo/breadsticks/env/lib/python3.7/site-packages/magic.py", line 80, in from_buffer return maybe_decode(magic_buffer(self.cookie, buf)) File "/Users/tzeppy/repo/breadsticks/env/lib/python3.7/site-packages/magic.py", line 206, in maybe_decode return s.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 162: invalid continuation byte
This is for a Windows CDF Doc:
file f7f565b00ae52c0328c634bba21b48bcf43430535d9edd8c95e86a9d0cdfa624 f7f565b00ae52c0328c634bba21b48bcf43430535d9edd8c95e86a9d0cdfa624: Composite Document File V2 Document, Little Endian, Os: Windows, Version 10.0, Code page: 1252, Author: XXX, Template: Normal, Last Saved By: XXX, Revision Number: 9, Name of Creating Application: Microsoft Office Word, Total Editing Time: 2d+17:37:00, Last Printed: Fri Jan 23 11:43:00 2015, Create Time/Date: Mon Jun 27 11:04:00 2016, Last Saved Time/Date: Mon Jan 20 08:53:00 2020, Number of Pages: 1, Number of Words: 102, Number of Characters: 561, Security: 0
Author name redacted for privacy.
When I replace the utf8 decoding in maybe_decode() with 'cp1252', the error disappears. I think the problem is the Author name, which has an accent.
I suggest replacing: s.decode('utf-8') with s.decode('utf-8', errors='replace')
Any chance you can share the file? I'd like a test case and am not sure how to create one.
Yes, I can give you one. I recommend you don't open it with msword, as it includes some macros that may or may not be malicious. But for python-magic purposes, it should do fine. You'll want to unzip it. Its a different file from my post above, but same exception result.
The attached file produces this output from from_file:
"""Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1, Code page: 1251, Author: user, Template: Normal.dotm, Last Saved By: Windows, Revision Number: 2, Name of Creating Application: Microsoft Office Word, Create Time/Date: Tue Jan 28 15:26:00 2020, Last Saved Time/Date: Tue Jan 28 15:26:00 2020, Number of Pages: 1, Number of Words: 0, Number of Characters: 1, Security: 0"""
Do you see the issue with this file?
Actually, it works fine when I try this on linux (ubuntu 18). But I do get the UnicodeDecodeError on my OSX laptop.
I can confirm this is a problem. It appears to happen with many other code pages. For the analysis that I do right now, the python-magic library is not currently usable. At the moment, I use the command line file
via subprocess and then check the output for the code page, then use that for the encoding of the output. Here is an example code snippet:
target = targets / 'afec938fe1fb66750511d2e0de717c49731329df9483ed042d122feb96a3d1fa'
process = subprocess.run(['file', '-b', str(target)], capture_output=True)
raw_output = process.stdout.strip()
print(raw_output)
if b'Code page: 1251' in raw_output:
encoding = 'cp1251'
elif b'Code page: 936' in raw_output:
encoding = 'cp936'
elif b'Code page: 950' in raw_output:
encoding = 'cp950'
elif b'Code page: 1252' in raw_output:
encoding = 'cp1252'
elif b'Code page: 1254' in raw_output:
encoding = 'cp1254'
elif b'Code page: 1256' in raw_output:
encoding = 'cp1256'
elif b'Code page: 1250' in raw_output:
encoding = 'cp1250'
elif b'Code page: -535' in raw_output:
encoding = 'cp950'
elif b'Code page: 949' in raw_output:
encoding = 'cp949'
else:
encoding = 'utf-8'
magic = process.stdout.strip().decode(encoding)
print(magic)
I'd like to help find a fix. I can provide many samples of files that meet the criteria above, but they're all malicious, so I'll try to find nonmalicious files to share as test cases.
I suggest replacing: s.decode('utf-8') with s.decode('utf-8', errors='replace')
This would make the exception dissapear, but it would not solve the problem.
That code snippet is a work in progress. I was originally using a regex and grabbing the numeric code page and then using it like so:
'cp{}'.format(encoding)
or
f'cp{encoding}'
That was before I encountered the -535
code page. I have been using this large if
statement for the moment and will eventually convert it to three choices, as long as I don't find any more strange code pages in my dataset.
Here's a little bit better code snippet:
import pathlib
import re
import subprocess
samples = pathlib.Path().home().joinpath('Desktop').joinpath('Samples')
target = samples / 'afec938fe1fb66750511d2e0de717c49731329df9483ed042d122feb96a3d1fa'
pattern = re.compile(b', Code page: (?P<codepage>-?[0-9]{1,5}), ')
process = subprocess.run(['file', '-b', str(target)], capture_output=True)
raw_output = process.stdout.strip()
print(raw_output)
match = re.search(pattern, raw_output)
if match:
print(match.group('codepage'))
codepage = int(match.group('codepage'))
else:
codepage = None
if codepage == -535:
encoding = 'cp950'
elif not codepage:
encoding = 'utf-8'
else:
encoding = f'cp{codepage}'
magic = raw_output.decode(encoding)
print(magic)
I figured out how to fix the problem. Rather than trying to do encoding or decoding or anything else, just gut all the python2/3 handling code. When the from_buffer()
method looks like this everything works as expected:
def from_buffer(self, buf):
"""
Identify the contents of `buf`
"""
with self.lock:
try:
return magic_buffer(self.cookie, buf)
except MagicException as e:
return self._handle509Bug(e)
Everything behaves as expected. The input is bytes and the output is bytes. No encoding or decoding decisions need to be made by python-magic at all. I would update all the code to do away with any Python 2 handling. It's EOL 6 months ago and should not be supported in any libraries.
I have a monkey patch that fixes this issue on the from_buffer()
method. I don't use any other methods, so if someone needs it, just do the same sort of thing to the method you need to use.
import magic
def from_buffer(self, buf):
with self.lock:
try:
return magic.magic_buffer(self.cookie, buf)
except magic.MagicException as e:
return self._handle509Bug(e)
magic.Magic.from_buffer = from_buffer
I just tested this monkey patch on ~2000 files with lots of different various encodings and it all works flawlessly.
@utkonos "I would update all the code to do away with any Python 2 handling."
This isn't python 2 handling specifically; when I first added support for python 3 it seemed most useful to return a str since these values are logically text. To be honest I didn't even realize at the time that libmagic could return non-ascii values here.
If you got bytes back, what would you do with them? Are you inferring the true encoding and then doing the decode yourself?
My inclination is to just make this .decode('utf-8', 'replace'), to silence the error. Thoughts?
I would recommend using .decode('utf-8', 'backslashreplace')
since it causes less data loss. The core problem is the way that magic parses and emits data about Microsoft compound files. I think there may be other file types with a similar problem, but I just have not encountered them. Magic emits a few fields that are encoded however they're encoded in the file and also emits a "Code page" field that is sometimes incorrect. If you choose UTF-8, then any non-English (mostly) Microsoft office file will cause magic to emit encoded data that is not UTF-8.
There are two choices, both suboptimal:
- Run everything through UTF-8 and backslash encode anything that doesn't decode.
- Let the end user decide how to decode the output by leaving it as bytes.
https://docs.python.org/3/library/codecs.html#codec-base-classes
backslashreplace looks like a good choice, will go with that.
I'll add a from_buffer_bytes/from_file_bytes to Magic as well for anyone that wants the behavior like your monkeypatch.
Great! Thanks!!
I've changed to backslashdecode in a74c994b704d3476e2054cc6332c0a4c49ea1c69.
This was fixed in 0.4.27