tika-python
tika-python copied to clipboard
unpack() crashes with emails with UTF-8 charcters in the headers
(I can provide PR with files and pytest cases that correct this behavior).
I've noticed when I call tika.unpack() with a file, or buffer that includes an email that contains UTF-8 characters.
Apache Tika unpacks the email without trouble; but tika-python breaks reading the 'METADATA' file from the tarfile we retrieve from Tika.
This is the exception I've got:
======================================================================
ERROR: test_unpack_email_with_specialchars (__main__.CreateTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests_unpack.py", line 75, in test_unpack_email_with_specialchars
unpacked = unpack.from_file(pfile)
File "c:\users\xxx\documents\devel\tika-python\tika\unpack.py", line 44, in from_file
return _parse(tarOutput)
File "c:\users\xxx\documents\devel\tika-python\tika\unpack.py", line 81, in _parse
for metadataLine in metadataReader:
File "c:\users\xxx\documents\devel\tika-python\tika\unpack.py", line 123, in _truncate_nulls
for line in s:
File "c:\users\xxx\appdata\local\continuum\miniconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 195: character maps to <undefined>
dear @igponce yes please provide test case, and PR which fixes the behavior. Thank you.
No PR was provided so closing. FWIW I think this is an issue in the upstream library or environment not in tika-python.