textract
textract copied to clipboard
extension recognition not working
I was trying to use textract on a html file but the extension is detected as txt. I am using Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 20:20:57) [MSC v.1600 64 bit (AMD64)] on win32 The windows Version is Windows 8.1
The code that is not working is
import textract textract.process(".html")
The working code is
import textract textract.process(".html",extension=".html")
it looks like _, ext = os.path.splitext(filename) in https://github.com/deanmalmgren/textract/blob/117ea191d93d80321e4bf01f23cc1ac54d69a075/textract/parsers/init.py#L54 always return ext = ""