textract extension recognition not working

extension recognition not working

Open ichfly opened this issue 7 years ago • 0 comments

I was trying to use textract on a html file but the extension is detected as txt. I am using Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 20:20:57) [MSC v.1600 64 bit (AMD64)] on win32 The windows Version is Windows 8.1

The code that is not working is

import textract textract.process(".html")

The working code is

import textract textract.process(".html",extension=".html")

it looks like _, ext = os.path.splitext(filename) in https://github.com/deanmalmgren/textract/blob/117ea191d93d80321e4bf01f23cc1ac54d69a075/textract/parsers/init.py#L54 always return ext = ""

Feb 26 '18 14:02 ichfly

textract textract copied to clipboard

extension recognition not working

textract
textract copied to clipboard