textract icon indicating copy to clipboard operation
textract copied to clipboard

Pdfminer and Tesseract not found

Open ObitoSigma opened this issue 5 years ago • 3 comments

Using Python 3.7.6, Pip 20.0.2, Conda 4.8.2, Spyder 4.0.1, and Textract 1.6.3.

When using textract.process('url', method='METHOD'), 'pdftotext' executes without problem (but the pdf is not text so it prints gibberish). When I try using 'tesseract' or 'pdfminer', I get the following (2?) error(s) which I'm hoping to resolve (example below is tesseract). Not well-versed with programming languages so let me know if it's anything obvious.

Traceback (most recent call last):

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run pipe = subprocess.Popen(

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in init super(SubprocessPopen, self).init(*args, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in init self._execute_child(args, executable, preexec_fn, close_fds,

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\untitled0.py", line 9, in text = textract.process('C:/Users/hanto/Desktop/Peapod1.pdf', method='tesseract', language='eng')

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers_init_.py", line 77, in process return parser.process(filename, encoding, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 46, in process byte_string = self.extract(filename, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 33, in extract return self.extract_tesseract(filename, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 61, in extract_tesseract page_content = TesseractParser().extract(page_path, **kwargs)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\image.py", line 20, in extract stdout, _ = self.run(args)

File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 90, in run raise exceptions.ShellError(

ShellError: The command tesseract C:\Users\hanto\AppData\Local\Temp\tmpqsxyoes8\conv-1.ppm stdout -l eng failed with exit code 127 ------------- stdout ------------- ------------- stderr -------------

ObitoSigma avatar Feb 04 '20 00:02 ObitoSigma

Tesseract is an external dependency that is not automatically installed along with textract. Since you're using Conda, you should be able to install the package in this link.

Pdfminer however is a Python dependency and should have been installed with textract. Could you show the complete error log when trying the pdfminer method?

jpweytjens avatar Feb 04 '20 08:02 jpweytjens

I installed tesseract via the link but still got the same error message. Here is the error for pdfminer:

Traceback (most recent call last):

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run
    pipe = subprocess.Popen(

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in __init__
    super(SubprocessPopen, self).__init__(*args, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

FileNotFoundError: [WinError 2] The system cannot find the file specified


Traceback (most recent call last):

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run
    pipe = subprocess.Popen(

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in __init__
    super(SubprocessPopen, self).__init__(*args, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

FileNotFoundError: [WinError 2] The system cannot find the file specified


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "C:\Users\hanto\AppData\Local\Temp\untitled0.py", line 9, in <module>
    text = textract.process('C:/Users/hanto/Desktop/Peapod1.pdf', method='pdfminer', language='eng')

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
    return self.extract_pdfminer(filename, **kwargs)

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
    stdout, _ = self.run(['pdf2txt.py', filename])

  File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 90, in run
    raise exceptions.ShellError(

ShellError: The command `pdf2txt.py C:/Users/hanto/Desktop/Peapod1.pdf` failed with exit code 127
------------- stdout -------------
------------- stderr -------------

ObitoSigma avatar Feb 04 '20 12:02 ObitoSigma

do you have pdf2txt binary installed in your computer?

RaSan147 avatar Mar 22 '21 02:03 RaSan147