textract
textract copied to clipboard
PDF extract failed!
When i extract text from a pdf, it output:
Traceback (most recent call last):
File "/usr/bin/textract", line 32, in
could be same as this? https://github.com/deanmalmgren/textract/issues/107 looks like pip install chardet==2.1.1 can solve the problem for python 2
Same error here, but NOT for all my PDF files.
Python 3.6.5 textract==1.6.1 chardet==2.3.0
"chardet.detect(text)" (utils.py, 64) returns {'encoding': None, 'confidence': 0.0}
text = textract.process(file, method='pdfminer')
Error:
UnboundLocalError Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/textract/parsers/init.py in process(filename, encoding, extension, **kwargs) 75 76 parser = filetype_module.Parser() ---> 77 return parser.process(filename, encoding, **kwargs) 78 79
~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in process(self, filename, encoding, **kwargs) 44 # output encoding 45 # http://nedbatchelder.com/text/unipain/unipain.html#35 ---> 46 byte_string = self.extract(filename, **kwargs) 47 unicode_string = self.decode(byte_string) 48 return self.encode(unicode_string, encoding)
~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract(self, filename, method, **kwargs) 29 30 elif method == 'pdfminer': ---> 31 return self.extract_pdfminer(filename, **kwargs) 32 elif method == 'tesseract': 33 return self.extract_tesseract(filename, **kwargs)
~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract_pdfminer(self, filename, **kwargs) 46 def extract_pdfminer(self, filename, **kwargs): 47 """Extract text from pdfs using pdfminer.""" ---> 48 stdout, _ = self.run(['pdf2txt.py', filename]) 49 return stdout 50
~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in run(self, args) 94 # pipe.wait() ends up hanging on large files. using 95 # pipe.communicate appears to avoid this issue ---> 96 stdout, stderr = pipe.communicate() 97 98 # if pipe is busted, raise an error (unlike Fabric)
UnboundLocalError: local variable 'pipe' referenced before assignment
Looks similar to #261 and #256. Could you try again with textract 1.6.2
? This version updates chardet to 3.0.4
.
Hi, I am hitting this error when I try to textract this PDF: Acta_Acustica_High_frequency_mistuning_2018.pdf
I am using textract 1.6.1 (latest version available via pip install) and chardet 3.0.4.
The output of chardet on the same file is "no result": $ chardet Acta_Acustica_High_frequency_mistuning_2018.pdf Acta_Acustica_High_frequency_mistuning_2018.pdf: no result
UPDATE: @jpweytjens, just saw your instruction on how to install a more recent textract on #261, so I tried again after installing textract 1.6.3. The error is exactly the same:
$ textract Acta_Acustica_High_frequency_mistuning_2018.pdf
Traceback (most recent call last):
File "/home/asartori/Dropbox/OSC/manuscript_version_detection/venv/bin/textract", line 33, in
UPDATE 2: Just for completeness, textract does not run into errors if I use method pdfminer, but it returns a bytes object rather than string: $ text = textract.process("Acta_Acustica_High_frequency_mistuning_2018.pdf", method="pdfminer") $ text[0:100] b'(cid:1)(cid:3)(cid:14)(cid:1) (cid:1)(cid:3)(cid:15)(cid:13)(cid:14)(cid:9)(cid:3)(cid:1) (cid:15)(c' $ type(text) <class 'bytes'>
@afs25 I'm aware that textract returns bytes
objects where it should be returning strings
instead. In the meanwhile, you can decode the textract output with the required decoding.
text = textract.process("Acta.pdf", method="pdfminer").decode("utf8")
As for the failing with chardet
, I'm currently far away from any computer. Feel free to ping me again in 2 weeks if I haven't fixed it these issues by then.
Sent with GitHawk
@afs25 I'm aware that textract returns
bytes
objects where it should be returningstrings
instead. In the meanwhile, you can decode the textract output with the required decoding.text = textract.process("Acta.pdf", method="pdfminer").decode("utf8")
As for the failing with
chardet
, I'm currently far away from any computer. Feel free to ping me again in 2 weeks if I haven't fixed it these issues by then.Sent with GitHawk Hello,Sir. Any solution right now?
With pdftotext
, there is absolutely no need to guess the encoding with chardet
, because pdftotext
always outputs UTF-8, unless specified otherwise with the -enc
option:
$ man pdftotext|grep -C3 UTF-8
Generate an XHTML file containing bounding box information for each block, line, and word in the file.
-enc encoding-name
Sets the encoding to use for text output. This defaults to "UTF-8".
-listenc
Lits the available encodings
Please stop using chardet
with pdftotext
and just treat the output as valid UTF-8
.
You users would be very thankful. :)
What about other methods, e.g does pdfminer or tesseract always return utf-8? Should we attempt to use chardet from the textract package or
from textract import process
from chardet import detect
text = process("file.pdf",method="tesseract",language="srp+srp_latn")
print(text.decode(detect(text)["encoding"]))
pdftotext works well only for simple pdf's, pdfminer/tesseract work better for my file but neither really return correct results, don't know how I should debug tesseract as it doesn't directly support pdf's, textract uses pdftoppm, right? Complaining here makes no sense if I can't make it work with just the tools in the background