textract
textract copied to clipboard
command line interface is broken on windows
Describe the bug The command line interface of textract is broken on windows. Even simple commands like "textract -h' is giving an exception.
$ textract -h
Traceback (most recent call last):
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
main()
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
parser = get_parser()
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
choices=_get_available_extensions(),
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
return _compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
code = _escape(source, this, state)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \P at position 2
To Reproduce
- Install textract in a test environment and activate it.
$ cat /h/work/myrepos/rutils/python3/envs/env_test_textract.yml
name: test_textract
channels:
- defaults
dependencies:
- python=3.7
- pip
- pip:
- textract
$ conda env create -f env_test_textract.yml
$ source activate test_textract
- Run textract
$ textract -h
Traceback (most recent call last):
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
main()
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
parser = get_parser()
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
choices=_get_available_extensions(),
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
return _compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
code = _escape(source, this, state)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \P at position 2
Expected behavior The command should print help page for textract and should not raise an exception.
Desktop (please complete the following information):
- OS: Windows: 10 Enterprise, using 'git bash' version 2.18.0
- Textract version: 1.6.3
- Python version: 3.7.5
- Virtual environment: yes
Additional context As the exception indicates the problem lies in _get_available_extensions() of https://github.com/deanmalmgren/textract/blob/master/textract/parsers/init.py#L89
The relevant code is
parsers_dir = os.path.join(os.path.dirname(__file__))
glob_filename = os.path.join(parsers_dir, "*" + _FILENAME_SUFFIX + ".py")
ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
The __file__
and glob_filename are evaluated as
C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py
C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\*_parser.py
I am able to reproduce the exception using these values as follows:
$ python
Python 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> glob_filename = 'C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\*_parser.py' >>> ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
return _compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
code = _escape(source, this, state)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \P at position 2
I think it is expecting forward slashes instead of backward slashes in glob_filename. Something like 'C:/ProgramData/Continuum/Anaconda/envs/test_textract/lib/site-packages/textract/parsers/*_parser.py'
Thank you for the extensive bug report.
os.path.join
escapes backward slashes, but printing these paths doesn't show them. You can look at print(repr(__file__))
to verify that they're there. The problem seems to be coming from the following changes in the re
module. Copying from the official documentation
Changed in Python version 3.6: Unknown escapes in pattern consisting of '\' and an ASCII letter now are errors.
Changed in Python version 3.7: Unknown escapes in repl consisting of '\' and an ASCII letter now are errors.
This requires to escape the backlashes once to handle backslashes in string, doubling the number of backslashes, and another time to not confuse the re module, for a final of 4 backslashes for each backslash needed in a regex pattern. I'll post a fix for the Git version of textract. A larger upcoming update of textract will include this in a more complete way.
Thanks for the reply, Johannes Weytjens!
Are the 3.6 and 3.7 you mentioned above are Python versions? I am getting a different exception with Python 3.5.6
$ textract -h
Traceback (most recent call last):
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
main()
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
parser = get_parser()
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
choices=_get_available_extensions(),
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 224, in compile
return _compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 293, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 536, in compile
p = sre_parse.parse(p, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 834, in parse
raise source.error("unbalanced parenthesis")
sre_constants.error: unbalanced parenthesis at position 99
My conda environment file to reproduce the exception
$ cat env_test_textract.yml
name: test_textract
channels:
- defaults
dependencies:
- python=3.5
- pip
- pip:
- textract
$ python --version
Python 3.5.6 :: Anaconda, Inc.
Yes, the 3.6 and 3.7 are python versions numbers.
I can't immediately reproduce this issue with python 3.5.4 on Windows 10. Could you try again with the Git version of textract? This includes a fix for the issues in python 3.6 and above. You can install it with the following command.
pip install git+https://github.com/deanmalmgren/textract
The latest github version gives an ImportError.
$ textract -h
Traceback (most recent call last):
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 11, in <module>
from textract.cli import get_parser
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 17, in <module>
from .parsers import DEFAULT_ENCODING, _get_available_extensions
ImportError: cannot import name 'DEFAULT_ENCODING' from 'textract.parsers' (C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py)
The DEFAULT_ENCODING was defined in 1.6.1 https://github.com/deanmalmgren/textract/blob/v1.6.1/textract/parsers/init.py#L25 . I think it was renamed to DEFAULT_OUTPUT_ENCODING in the latest version https://github.com/deanmalmgren/textract/blob/master/textract/parsers/init.py#L25 but not all the old references were cleaned up.
But even after changing all those DEFAULT_ENCODING occurrences in cli.py to DEFAULT_OUTPUT_ENCODING, I still get the same exception when I run 'textract -h'.
Thank you for pointing out that I missed changing the DEFAULT_ENCODING
everywhere. Nevertheless, fixing this I can't reproduce the issue you encounter. I will look more into the problem this weekend.