textract icon indicating copy to clipboard operation
textract copied to clipboard

On Windows, use of pdfminer results in "WindowsError: [Error 193] %1 is not a valid Win32 application"

Open ConnectedSystems opened this issue 7 years ago • 12 comments

textract.process(path_to_pdf, method='pdfminer')

Issue can be resolved by adding shell=True to the subprocess.Popen() call (circa line 82 in utils.py), but this is probably not an ideal workaround.

ConnectedSystems avatar Apr 26 '17 07:04 ConnectedSystems

I'm having the same issue, but adding shell=True on line 82 of utils.py didn't fix it for me.

sys.version = 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]

textract.VERSION = 1.5.0

Full debug message:

OSError                                   Traceback (most recent call last)
<ipython-input-80-aca88a5f0591> in <module>()
      1 import textract
----> 2 textract.process('test.pdf', method='pdfminer')

C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\__init__.py in process(filename, encoding, **kwargs)
     56     filetype_module = importlib.import_module(rel_module, 'textract.parsers')
     57     parser = filetype_module.Parser()
---> 58     return parser.process(filename, encoding, **kwargs)

C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py in process(self, filename, encoding, **kwargs)
     43         # output encoding
     44         # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 45         byte_string = self.extract(filename, **kwargs)
     46         unicode_string = self.decode(byte_string)
     47         return self.encode(unicode_string, encoding)

C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\pdf_parser.py in extract(self, filename, method, **kwargs)
     29 
     30         elif method == 'pdfminer':
---> 31             return self.extract_pdfminer(filename, **kwargs)
     32         elif method == 'tesseract':
     33             return self.extract_tesseract(filename, **kwargs)

C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\pdf_parser.py in extract_pdfminer(self, filename, **kwargs)
     46     def extract_pdfminer(self, filename, **kwargs):
     47         """Extract text from pdfs using pdfminer."""
---> 48         stdout, _ = self.run(['pdf2txt.py', filename])
     49         return stdout
     50 

C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py in run(self, args)
     81         pipe = subprocess.Popen(
     82             args, shell=True,
---> 83             stdout=subprocess.PIPE, stderr=subprocess.PIPE,
     84         )
     85 

C:\ProgramData\Anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
    705                                 c2pread, c2pwrite,
    706                                 errread, errwrite,
--> 707                                 restore_signals, start_new_session)
    708         except:
    709             # Cleanup if the child failed starting.

C:\ProgramData\Anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
    988                                          env,
    989                                          cwd,
--> 990                                          startupinfo)
    991             finally:
    992                 # Child is launched. Close the parent's copy of those pipe

OSError: [WinError 193] %1 is not a valid Win32 application

liwenyip avatar Apr 29 '17 12:04 liwenyip

I usually work in Python 2.7 but I do have a Py3 environment set up. Both environments have textract 1.5 installed via pip.

The issue occurred as reported in my Py3 environment, however I could get it working using the shell=True workaround.

My only guess at the moment is that the module was not fully reloaded after you modified utils.py. Could you confirm that you restarted your IPython kernel after the utils.py file was changed?

ConnectedSystems avatar Apr 29 '17 14:04 ConnectedSystems

shell=True creates a security vulnerability that was fixed in #114. I do not recommend using shell=True, particularly if you have textract connected to a web application that others can use.

I'm not sure how you're trying to use this within a jupyter notebook, but its entirely the case that jupyter might not allow other executables to run on your system. The pdf parser runs shell commands outside of python. I don't know enough about jupyter notebooks, but my guess is that this is probably the source of the issue.

deanmalmgren avatar Jun 16 '17 11:06 deanmalmgren

Can you confirm that this is still an issue with textract 1.6.1?

If so, can you also try to diagnose what the args variable is in the subprocess.Popen call here? It is odd to me that it is describing %1 as a command that doesn't exist and I want to confirm that there isn't something bizarre happening before the subprocess call. Thanks!

Another good test would be to see if you can run the shell command directly on the filename from within your jupyter notebook. It will be the equivalent of something like pdf2txt.py FILENAME. If that works, then it seems like something strange must be happening with subprocess + jupyter. If it doesn't work, then perhaps your jupyter shell doesn't have access to the pdf2txt.py executable.

deanmalmgren avatar Jul 21 '17 13:07 deanmalmgren

Hi Dean,

Still an issue here using v1.6.1, python 2.7, using a conda environment. Output below were identical between the command line and through jupyter notebook.

Using the provided process method I get a UnboundLocalError: local variable 'pipe' referenced before assignment.

Delving in:

import subprocess

args = ['pdf2txt.py', path_to_pdf]

pipe = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = pipe.communicate()
print(stdout, stderr)

Results in a WindowsError: [Error 193] %1 is not a valid Win32 application

Adding python to the list of args:

import subprocess

args = ['python', 'pdf2txt.py', path_to_pdf]

pipe = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = pipe.communicate()
print(stdout, stderr)

Result:

('', "python: can't open file 'pdf2txt.py': [Errno 2] No such file or directory\r\n")

Specifying location of pdf2txt.py works

import subprocess

args = ['python', path_to_pdf2txt, path_to_pdf]

pipe = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = pipe.communicate()
print(stdout, stderr)

Path to pdf2txt.py was not found in sys.path. Adding it to sys.path did not work.

ConnectedSystems avatar Aug 07 '17 09:08 ConnectedSystems

This seems like it is connected to not having PDF Miner (the pdf2txt.py script) available from sys.path. pip install textract should install pdf2txt.py via the requirements in /usr/local/bin or equivalent on your windows system. From the command line, try typing which pdf2txt.py and make sure that the directory that contains pdf2txt.py is on your PYTHONPATH and/or sys.path.

How are you installing textract in your conda environment?

To the best of my knowledge, this does not appear to be a bug with textract, per se. Instead this appears to be an issue of the required dependencies not being available in Windows / conda / jupyter notebooks for some reason.

deanmalmgren avatar Aug 07 '17 14:08 deanmalmgren

I installed it in the conda environment using pip.

I did install pdfminer.six myself after I realised it wasn't installed after pip install textract.

Just tested a straight pdf2txt.py through the command prompt and it worked.

where pdf2txt.py displays the correct path - in my case "d:\windows_utils\miniconda3\envs\textract\Scripts\pdf2txt.py"

Typing in echo %PATH% also lists the above path.

I then opened up the notebook and did the following:

import sys
print(sys.path)

and while the path to the environment appears, the Scripts directory does not... Adding the path with sys.path.append does not make textract work however.

So I surmise that you are correct, and it is an issue with jupyter notebook

ConnectedSystems avatar Aug 08 '17 07:08 ConnectedSystems

It happens also to me, in the console, didn't try jupyter. I am trying to make pdfminer works as a backup when the user does not have pdftotext installed on the system.

lrq3000 avatar Nov 12 '17 17:11 lrq3000

Indeed I can confirm the problem is that the path to "Scripts" folder is not in the Windows system path. The user must do that manually.

I can't ask my users to do that, so unfortunately I will have to package a copy of pdf2txt.py within my own package, as there is no reliable way to know where the "Scripts" is and add it dynamically to the path that is both crossplatform and also works when frozen using PyInstaller or Py2exe :-/

lrq3000 avatar Nov 12 '17 18:11 lrq3000

Here is my wrapper when you copy pdf2txt.py as a local submodule:

class MyPdfMinerParser(ShellParser):
    """Extract text from pdf files using the native python PdfMiner library"""

    def extract(self, filename, **kwargs):
        """Extract text from pdfs using pdfminer and pdf2txt.py wrapper."""
        # Create a temporary output file
        tempfilefh, tempfilepath = mkstemp(suffix='.txt')
        os.close(tempfilefh)  # close to allow writing to tesseract
        # Extract text from pdf using the entry script pdf2txt (part of PdfMiner)
        pdf2txt.main(['', '-o', tempfilepath, filename])
        # Read the results of extraction
        with open(tempfilepath, 'rb') as f:
            res = f.read()
        # Remove temporary output file
        os.remove(tempfilepath)
        return res

pdfminerparser = MyPdfMinerParser()
result = pdfminerparser.process('path/to/file.pdf', 'utf8')

lrq3000 avatar Nov 12 '17 18:11 lrq3000

The following works on the Windows 10 system I can test on.

import subprocess
import sys

args = [sys.executable(), full_path_to_pdf2txt, path_to_pdf]

pipe = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = pipe.communicate()
print(stdout, stderr)

The full path to pdf2txt.py can be found with something like

args = ["where", "pdf2txt.py"]
pipe = subprocess.Popen(args, stdout=subprocess.PIPE)
full_path_to_pdf2txt = pipe.communicate()

I need to do some further testing, but I hope to include in the next release of textract.

jpweytjens avatar Oct 11 '19 09:10 jpweytjens

args = [sys.executable(), full_path_to_pdf2txt, path_to_pdf]

Thanks, works on Windows, but I had to use sys.executable

Chrescht avatar Jun 24 '20 08:06 Chrescht