Pdfminer and Tesseract not found
Using Python 3.7.6, Pip 20.0.2, Conda 4.8.2, Spyder 4.0.1, and Textract 1.6.3.
When using textract.process('url', method='METHOD'), 'pdftotext' executes without problem (but the pdf is not text so it prints gibberish). When I try using 'tesseract' or 'pdfminer', I get the following (2?) error(s) which I'm hoping to resolve (example below is tesseract). Not well-versed with programming languages so let me know if it's anything obvious.
Traceback (most recent call last):
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run pipe = subprocess.Popen(
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in init super(SubprocessPopen, self).init(*args, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in init self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\untitled0.py", line 9, in text = textract.process('C:/Users/hanto/Desktop/Peapod1.pdf', method='tesseract', language='eng')
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers_init_.py", line 77, in process return parser.process(filename, encoding, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 46, in process byte_string = self.extract(filename, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 33, in extract return self.extract_tesseract(filename, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 61, in extract_tesseract page_content = TesseractParser().extract(page_path, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\image.py", line 20, in extract stdout, _ = self.run(args)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 90, in run raise exceptions.ShellError(
ShellError: The command tesseract C:\Users\hanto\AppData\Local\Temp\tmpqsxyoes8\conv-1.ppm stdout -l eng failed with exit code 127 ------------- stdout ------------- ------------- stderr -------------
Tesseract is an external dependency that is not automatically installed along with textract. Since you're using Conda, you should be able to install the package in this link.
Pdfminer however is a Python dependency and should have been installed with textract. Could you show the complete error log when trying the pdfminer method?
I installed tesseract via the link but still got the same error message. Here is the error for pdfminer:
Traceback (most recent call last):
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run
pipe = subprocess.Popen(
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in __init__
super(SubprocessPopen, self).__init__(*args, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
Traceback (most recent call last):
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 82, in run
pipe = subprocess.Popen(
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 104, in __init__
super(SubprocessPopen, self).__init__(*args, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\hanto\AppData\Local\Temp\untitled0.py", line 9, in <module>
text = textract.process('C:/Users/hanto/Desktop/Peapod1.pdf', method='pdfminer', language='eng')
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
return self.extract_pdfminer(filename, **kwargs)
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
stdout, _ = self.run(['pdf2txt.py', filename])
File "C:\Users\hanto\Anaconda3\envs\myEnv\lib\site-packages\textract\parsers\utils.py", line 90, in run
raise exceptions.ShellError(
ShellError: The command `pdf2txt.py C:/Users/hanto/Desktop/Peapod1.pdf` failed with exit code 127
------------- stdout -------------
------------- stderr -------------
do you have pdf2txt binary installed in your computer?