textract UTF-8 encoded files not properly decoded

UTF-8 encoded files not properly decoded

Open workflowsguy opened this issue 7 years ago • 4 comments

I have a text file encoded in UTF-8 containing

This is a Text with Umlauts: äöüßÄÖÜ

Running print(textract.process(commandlineArguments.filename)) on this file under Python 3 gives

b'This is a Text with Umlauts: \xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\xc3\x84\xc3\x96\xc3\x9c'

The same with a pdf file containing umlaut characters.

Adding an encoding='utf-8' parameter has no effect.

Feb 18 '18 16:02 workflowsguy

Is it related to textract? What when you decode your string? https://stackoverflow.com/a/37016987

Jul 05 '19 14:07 ignacy130

Is it related to textract?

Judging from the following code and the output, I'd say "yes"

import textract
text = textract.process('Umlauttest.txt')
print(text)
print('==================')
with open('Umlauttest.txt', 'r') as file:
	text = file.read()
print(text)

b'This is a text with Umlauts: \xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\xc3\x84\xc3\x96\xc3\x9c\nDies ist ein Text mit Umlauten: \xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\xc3\x84\xc3\x96\xc3\x9c\n'
==================
This is a text with Umlauts: äöüßÄÖÜ
Dies ist ein Text mit Umlauten: äöüßÄÖÜ

Jul 12 '19 19:07 workflowsguy

@workflowsguy I need to look into this why textract is returning a bytes object and not a str object. In the meanwhile, you can do the following

import textract as txt
text = txt.process("Umlauttest.txt")
text = text.decode("utf8")

Jul 25 '19 14:07 jpweytjens

Is this... on hiatus

Mar 17 '20 00:03 filipopo

textract textract copied to clipboard

UTF-8 encoded files not properly decoded

textract
textract copied to clipboard