textract
textract copied to clipboard
On Windows, use of pdfminer results in "WindowsError: [Error 193] %1 is not a valid Win32 application"
textract.process(path_to_pdf, method='pdfminer')
Issue can be resolved by adding shell=True
to the subprocess.Popen()
call (circa line 82 in utils.py), but this is probably not an ideal workaround.
I'm having the same issue, but adding shell=True
on line 82 of utils.py didn't fix it for me.
sys.version = 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]
textract.VERSION = 1.5.0
Full debug message:
OSError Traceback (most recent call last)
<ipython-input-80-aca88a5f0591> in <module>()
1 import textract
----> 2 textract.process('test.pdf', method='pdfminer')
C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\__init__.py in process(filename, encoding, **kwargs)
56 filetype_module = importlib.import_module(rel_module, 'textract.parsers')
57 parser = filetype_module.Parser()
---> 58 return parser.process(filename, encoding, **kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py in process(self, filename, encoding, **kwargs)
43 # output encoding
44 # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 45 byte_string = self.extract(filename, **kwargs)
46 unicode_string = self.decode(byte_string)
47 return self.encode(unicode_string, encoding)
C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\pdf_parser.py in extract(self, filename, method, **kwargs)
29
30 elif method == 'pdfminer':
---> 31 return self.extract_pdfminer(filename, **kwargs)
32 elif method == 'tesseract':
33 return self.extract_tesseract(filename, **kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\pdf_parser.py in extract_pdfminer(self, filename, **kwargs)
46 def extract_pdfminer(self, filename, **kwargs):
47 """Extract text from pdfs using pdfminer."""
---> 48 stdout, _ = self.run(['pdf2txt.py', filename])
49 return stdout
50
C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py in run(self, args)
81 pipe = subprocess.Popen(
82 args, shell=True,
---> 83 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
84 )
85
C:\ProgramData\Anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
705 c2pread, c2pwrite,
706 errread, errwrite,
--> 707 restore_signals, start_new_session)
708 except:
709 # Cleanup if the child failed starting.
C:\ProgramData\Anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
988 env,
989 cwd,
--> 990 startupinfo)
991 finally:
992 # Child is launched. Close the parent's copy of those pipe
OSError: [WinError 193] %1 is not a valid Win32 application
I usually work in Python 2.7 but I do have a Py3 environment set up.
Both environments have textract 1.5 installed via pip
.
The issue occurred as reported in my Py3 environment, however I could get it working using the shell=True
workaround.
My only guess at the moment is that the module was not fully reloaded after you modified utils.py
. Could you confirm that you restarted your IPython kernel after the utils.py
file was changed?
shell=True
creates a security vulnerability that was fixed in #114. I do not recommend using shell=True
, particularly if you have textract
connected to a web application that others can use.
I'm not sure how you're trying to use this within a jupyter notebook, but its entirely the case that jupyter might not allow other executables to run on your system. The pdf parser runs shell commands outside of python. I don't know enough about jupyter notebooks, but my guess is that this is probably the source of the issue.
Can you confirm that this is still an issue with textract 1.6.1?
If so, can you also try to diagnose what the args
variable is in the subprocess.Popen
call here? It is odd to me that it is describing %1
as a command that doesn't exist and I want to confirm that there isn't something bizarre happening before the subprocess call. Thanks!
Another good test would be to see if you can run the shell command directly on the filename from within your jupyter notebook. It will be the equivalent of something like pdf2txt.py FILENAME
. If that works, then it seems like something strange must be happening with subprocess + jupyter. If it doesn't work, then perhaps your jupyter shell doesn't have access to the pdf2txt.py executable.
Hi Dean,
Still an issue here using v1.6.1, python 2.7, using a conda environment. Output below were identical between the command line and through jupyter notebook.
Using the provided process
method I get a UnboundLocalError: local variable 'pipe' referenced before assignment
.
Delving in:
import subprocess
args = ['pdf2txt.py', path_to_pdf]
pipe = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = pipe.communicate()
print(stdout, stderr)
Results in a WindowsError: [Error 193] %1 is not a valid Win32 application
Adding python
to the list of args:
import subprocess
args = ['python', 'pdf2txt.py', path_to_pdf]
pipe = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = pipe.communicate()
print(stdout, stderr)
Result:
('', "python: can't open file 'pdf2txt.py': [Errno 2] No such file or directory\r\n")
Specifying location of pdf2txt.py works
import subprocess
args = ['python', path_to_pdf2txt, path_to_pdf]
pipe = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = pipe.communicate()
print(stdout, stderr)
Path to pdf2txt.py was not found in sys.path. Adding it to sys.path did not work.
This seems like it is connected to not having PDF Miner (the pdf2txt.py
script) available from sys.path
. pip install textract
should install pdf2txt.py
via the requirements in /usr/local/bin
or equivalent on your windows system. From the command line, try typing which pdf2txt.py
and make sure that the directory that contains pdf2txt.py
is on your PYTHONPATH
and/or sys.path
.
How are you installing textract
in your conda environment?
To the best of my knowledge, this does not appear to be a bug with textract, per se. Instead this appears to be an issue of the required dependencies not being available in Windows / conda / jupyter notebooks for some reason.
I installed it in the conda environment using pip.
I did install pdfminer.six
myself after I realised it wasn't installed after pip install textract
.
Just tested a straight pdf2txt.py
through the command prompt and it worked.
where pdf2txt.py
displays the correct path - in my case "d:\windows_utils\miniconda3\envs\textract\Scripts\pdf2txt.py"
Typing in echo %PATH%
also lists the above path.
I then opened up the notebook and did the following:
import sys
print(sys.path)
and while the path to the environment appears, the Scripts
directory does not...
Adding the path with sys.path.append
does not make textract work however.
So I surmise that you are correct, and it is an issue with jupyter notebook
It happens also to me, in the console, didn't try jupyter. I am trying to make pdfminer works as a backup when the user does not have pdftotext installed on the system.
Indeed I can confirm the problem is that the path to "Scripts" folder is not in the Windows system path. The user must do that manually.
I can't ask my users to do that, so unfortunately I will have to package a copy of pdf2txt.py within my own package, as there is no reliable way to know where the "Scripts" is and add it dynamically to the path that is both crossplatform and also works when frozen using PyInstaller or Py2exe :-/
Here is my wrapper when you copy pdf2txt.py as a local submodule:
class MyPdfMinerParser(ShellParser):
"""Extract text from pdf files using the native python PdfMiner library"""
def extract(self, filename, **kwargs):
"""Extract text from pdfs using pdfminer and pdf2txt.py wrapper."""
# Create a temporary output file
tempfilefh, tempfilepath = mkstemp(suffix='.txt')
os.close(tempfilefh) # close to allow writing to tesseract
# Extract text from pdf using the entry script pdf2txt (part of PdfMiner)
pdf2txt.main(['', '-o', tempfilepath, filename])
# Read the results of extraction
with open(tempfilepath, 'rb') as f:
res = f.read()
# Remove temporary output file
os.remove(tempfilepath)
return res
pdfminerparser = MyPdfMinerParser()
result = pdfminerparser.process('path/to/file.pdf', 'utf8')
The following works on the Windows 10 system I can test on.
import subprocess
import sys
args = [sys.executable(), full_path_to_pdf2txt, path_to_pdf]
pipe = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = pipe.communicate()
print(stdout, stderr)
The full path to pdf2txt.py
can be found with something like
args = ["where", "pdf2txt.py"]
pipe = subprocess.Popen(args, stdout=subprocess.PIPE)
full_path_to_pdf2txt = pipe.communicate()
I need to do some further testing, but I hope to include in the next release of textract.
args = [sys.executable(), full_path_to_pdf2txt, path_to_pdf]
Thanks, works on Windows, but I had to use sys.executable