textract icon indicating copy to clipboard operation
textract copied to clipboard

add ability to specify custom extraction methods across different file types.

Open deanmalmgren opened this issue 8 years ago • 7 comments

@barrust brought this up in #122

When iterating over a large number of files, it is difficult to specify non-standard method kwargs for different filetypes. For example, currently method is used for PDFs and engine is used for audio files:

for filename in dir:
   txt = textract.process(filename, method='tesseract', engine='sphinx' )
   print(txt)

I personally like the simplicity of always having a method kwarg for the textract.process function, but what if we gave users the ability to test what extension a file has before it is processed so they can easily handle PDFs vs audio files, for example. I'm thinking of something like this:

for filename in filenames:
    ext = textract.get_extension(filename)
    if ext == 'pdf':
        kwargs = {'method': 'tesseract'}
    elif ext.is_audio:
        kwargs = {'method': 'sphinx'}
    txt = textract.process(filename, **kwargs)
    print(txt)

Another approach would be to turn the method kwarg to also accept a dictionary:

methods = {
    'pdf': 'tesseract',
    'audio': 'sphinx',
}
for filename in filenames:
    txt = textract.process(filename, method=methods)
    print(txt)

deanmalmgren avatar Mar 24 '17 09:03 deanmalmgren

Another issue here is thinking about how we will deal with this on the command line...

deanmalmgren avatar Mar 24 '17 10:03 deanmalmgren

What about making it a comma delimited list? Each function with multiple extraction methods would have to handle splitting it but it could solve the CLI issue.

for filename in filenames:
    txt = textract.process(filename, method='tesseract,sphinx')
    print(txt)

I would assume that the CLI would require that there be no spaces between different methods.

I also like the other proposed ideas but not sure how they would work with the CLI. The idea of getting back extensions would be great!

barrust avatar Mar 24 '17 15:03 barrust

Below are a couple of ideas for the CLI. There are probably others (please share if you have ideas!), but I think I have a slight preference for the configuration file approach. It gives textract the option to specify lots of kwargs at the same time to textract.process. Then, instead of overloading the method kwarg to textract.process, we can use a configuration object to override defaults.

Thoughts and feedback welcome! This is definitely a major change to the UI and would warrant a major version bump to 2.0.0 so I want to make sure we get this right.

hyphenated command line arg

for f in directory/*; do
    textract --method-pdf tesseract --method-audio sphinx $f
done

colon-ized command line value

for f in directory/*; do
    textract --method pdf:tesseract --method audio:sphinx $f
done

json command line value

for f in directory/*; do
    textract --method '{"pdf":"tesseract","audio":"sphinx"}' $f
done

conf file

for f in directory/*; do
    textract --conf textract.conf $f
done

deanmalmgren avatar Mar 24 '17 16:03 deanmalmgren

I like the configuration file option better than having to form valid json! The other two options of colons or hyphens are both good but I think the configuration file will likely be more future proof.

barrust avatar Mar 28 '17 14:03 barrust

Thanks for the input @barrust

INI format? YAML?

I think I have a small preference for YAML but I welcome arguments from others.

If I have time, I may try to mock this up on my flight back to Chicago on Friday. Sounds fun :)

deanmalmgren avatar Mar 28 '17 18:03 deanmalmgren

I'm leaning toward INI format with this so people can set it in their project's setup.cfg. I also think we could probably address #96 at the same time which would be very nice.

deanmalmgren avatar Jul 21 '17 14:07 deanmalmgren

It's been a long time but this works for now if anyone's interested and checking this out:

from os.path import splitext
from textract import process

switcher = {
	"pdf": "pdfminer",
	"mp3": "SpeechRecognition"
}

filenames=["hoho.txt", "asdf.pdf", "example.mp3"]
for filename in filenames:
	ext = splitext(filename)[1][1:]
	method = switcher.get(ext, "")
	text = process(filename, method=method)

filipopo avatar Mar 17 '20 00:03 filipopo