textract icon indicating copy to clipboard operation
textract copied to clipboard

command line interface is broken on windows

Open KamarajuKusumanchi opened this issue 5 years ago • 5 comments

Describe the bug The command line interface of textract is broken on windows. Even simple commands like "textract -h' is giving an exception.

$ textract -h
Traceback (most recent call last):
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
    main()
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
    parser = get_parser()
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
    choices=_get_available_extensions(),
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
    ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
    return _compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \P at position 2

To Reproduce

  1. Install textract in a test environment and activate it.
$ cat /h/work/myrepos/rutils/python3/envs/env_test_textract.yml
name: test_textract
channels:
  - defaults
dependencies:
  - python=3.7
  - pip
  - pip:
    - textract

$ conda env create -f env_test_textract.yml

$ source activate test_textract
  1. Run textract
$ textract -h
Traceback (most recent call last):
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
    main()
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
    parser = get_parser()
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
    choices=_get_available_extensions(),
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
    ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
    return _compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \P at position 2

Expected behavior The command should print help page for textract and should not raise an exception.

Desktop (please complete the following information):

  • OS: Windows: 10 Enterprise, using 'git bash' version 2.18.0
  • Textract version: 1.6.3
  • Python version: 3.7.5
  • Virtual environment: yes

Additional context As the exception indicates the problem lies in _get_available_extensions() of https://github.com/deanmalmgren/textract/blob/master/textract/parsers/init.py#L89

The relevant code is

    parsers_dir = os.path.join(os.path.dirname(__file__))
    glob_filename = os.path.join(parsers_dir, "*" + _FILENAME_SUFFIX + ".py")
    ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))

The __file__ and glob_filename are evaluated as

C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py
C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\*_parser.py

I am able to reproduce the exception using these values as follows:

$ python
Python 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> glob_filename = 'C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\*_parser.py' >>> ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
    return _compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \P at position 2

I think it is expecting forward slashes instead of backward slashes in glob_filename. Something like 'C:/ProgramData/Continuum/Anaconda/envs/test_textract/lib/site-packages/textract/parsers/*_parser.py'

KamarajuKusumanchi avatar Nov 12 '19 23:11 KamarajuKusumanchi

Thank you for the extensive bug report.

os.path.join escapes backward slashes, but printing these paths doesn't show them. You can look at print(repr(__file__)) to verify that they're there. The problem seems to be coming from the following changes in the re module. Copying from the official documentation

Changed in Python version 3.6: Unknown escapes in pattern consisting of '\' and an ASCII letter now are errors.
Changed in Python version 3.7: Unknown escapes in repl consisting of '\' and an ASCII letter now are errors.

This requires to escape the backlashes once to handle backslashes in string, doubling the number of backslashes, and another time to not confuse the re module, for a final of 4 backslashes for each backslash needed in a regex pattern. I'll post a fix for the Git version of textract. A larger upcoming update of textract will include this in a more complete way.

jpweytjens avatar Nov 13 '19 13:11 jpweytjens

Thanks for the reply, Johannes Weytjens!

Are the 3.6 and 3.7 you mentioned above are Python versions? I am getting a different exception with Python 3.5.6

$ textract -h
Traceback (most recent call last):
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
    main()
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
    parser = get_parser()
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
    choices=_get_available_extensions(),
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
    ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 224, in compile
    return _compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 293, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 536, in compile
    p = sre_parse.parse(p, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 834, in parse
    raise source.error("unbalanced parenthesis")
sre_constants.error: unbalanced parenthesis at position 99

My conda environment file to reproduce the exception

$ cat env_test_textract.yml
name: test_textract
channels:
  - defaults
dependencies:
  - python=3.5
  - pip
  - pip:
    - textract
$ python --version
Python 3.5.6 :: Anaconda, Inc.

KamarajuKusumanchi avatar Nov 13 '19 23:11 KamarajuKusumanchi

Yes, the 3.6 and 3.7 are python versions numbers.

I can't immediately reproduce this issue with python 3.5.4 on Windows 10. Could you try again with the Git version of textract? This includes a fix for the issues in python 3.6 and above. You can install it with the following command.

pip install git+https://github.com/deanmalmgren/textract

jpweytjens avatar Nov 14 '19 08:11 jpweytjens

The latest github version gives an ImportError.

$ textract -h
Traceback (most recent call last):
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 11, in <module>
    from textract.cli import get_parser
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 17, in <module>
    from .parsers import DEFAULT_ENCODING, _get_available_extensions
ImportError: cannot import name 'DEFAULT_ENCODING' from 'textract.parsers' (C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py)

The DEFAULT_ENCODING was defined in 1.6.1 https://github.com/deanmalmgren/textract/blob/v1.6.1/textract/parsers/init.py#L25 . I think it was renamed to DEFAULT_OUTPUT_ENCODING in the latest version https://github.com/deanmalmgren/textract/blob/master/textract/parsers/init.py#L25 but not all the old references were cleaned up.

But even after changing all those DEFAULT_ENCODING occurrences in cli.py to DEFAULT_OUTPUT_ENCODING, I still get the same exception when I run 'textract -h'.

KamarajuKusumanchi avatar Nov 14 '19 17:11 KamarajuKusumanchi

Thank you for pointing out that I missed changing the DEFAULT_ENCODING everywhere. Nevertheless, fixing this I can't reproduce the issue you encounter. I will look more into the problem this weekend.

jpweytjens avatar Nov 27 '19 08:11 jpweytjens