textract icon indicating copy to clipboard operation
textract copied to clipboard

Drop python2 support

Open tehabstract opened this issue 3 years ago • 2 comments

Dropping python2 support, loosening up dependencies. Please comment if you want dependencies in a different format, or any changes and I will adjust.

Introduced openpyxl for xlsx files. Updated 2 test files:

  • pdf - since recent versions of pdfminer.six parse the file a little different.
  • xlsx - since openpyxl parses the file a little differently than xlrd - notably bool ( 1 -> True )

Updated travis, vagrant, dockerfile in tests.

Upped the version to 1.7.0, added to changelog.

Thanks

tehabstract avatar Aug 18 '22 15:08 tehabstract

@deanmalmgren Any chance this could get looked into? Python 2 was left with no support on Jan 1 2020, and the older packages required for textract to work with 2.7 do cause conflicts. In particular, our team would appreciate bumping pdfminer.six to a newer version.

pdfminer.six >= 20200726 is required for using unstructured, which is required by langchain!

twolfvb avatar Jun 16 '23 16:06 twolfvb

Quick note that I've tested this patch lightly, the only problem I've found so far relates to an update to Python's subprocess module:

diff --git a/textract/parsers/utils.py b/textract/parsers/utils.py
index 11ec8a1..efb0d9c 100755
--- a/textract/parsers/utils.py
+++ b/textract/parsers/utils.py
@@ -83,7 +83,7 @@ class ShellParser(BaseParser):
         """

         # run a subprocess and put the stdout and stderr on the pipe object
-        if subprocess.mswindows:
+        if subprocess._mswindows:
             startupinfo = subprocess.STARTUPINFO()
             startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
         else:

Otherwise it's been working well for me.

thehunmonkgroup avatar Jul 09 '23 22:07 thehunmonkgroup