add ability to specify custom extraction methods across different file types.
@barrust brought this up in #122
When iterating over a large number of files, it is difficult to specify non-standard method kwargs for different filetypes. For example, currently method is used for PDFs and engine is used for audio files:
for filename in dir:
txt = textract.process(filename, method='tesseract', engine='sphinx' )
print(txt)
I personally like the simplicity of always having a method kwarg for the textract.process function, but what if we gave users the ability to test what extension a file has before it is processed so they can easily handle PDFs vs audio files, for example. I'm thinking of something like this:
for filename in filenames:
ext = textract.get_extension(filename)
if ext == 'pdf':
kwargs = {'method': 'tesseract'}
elif ext.is_audio:
kwargs = {'method': 'sphinx'}
txt = textract.process(filename, **kwargs)
print(txt)
Another approach would be to turn the method kwarg to also accept a dictionary:
methods = {
'pdf': 'tesseract',
'audio': 'sphinx',
}
for filename in filenames:
txt = textract.process(filename, method=methods)
print(txt)
Another issue here is thinking about how we will deal with this on the command line...
What about making it a comma delimited list? Each function with multiple extraction methods would have to handle splitting it but it could solve the CLI issue.
for filename in filenames:
txt = textract.process(filename, method='tesseract,sphinx')
print(txt)
I would assume that the CLI would require that there be no spaces between different methods.
I also like the other proposed ideas but not sure how they would work with the CLI. The idea of getting back extensions would be great!
Below are a couple of ideas for the CLI. There are probably others (please share if you have ideas!), but I think I have a slight preference for the configuration file approach. It gives textract the option to specify lots of kwargs at the same time to textract.process. Then, instead of overloading the method kwarg to textract.process, we can use a configuration object to override defaults.
Thoughts and feedback welcome! This is definitely a major change to the UI and would warrant a major version bump to 2.0.0 so I want to make sure we get this right.
hyphenated command line arg
for f in directory/*; do
textract --method-pdf tesseract --method-audio sphinx $f
done
colon-ized command line value
for f in directory/*; do
textract --method pdf:tesseract --method audio:sphinx $f
done
json command line value
for f in directory/*; do
textract --method '{"pdf":"tesseract","audio":"sphinx"}' $f
done
conf file
for f in directory/*; do
textract --conf textract.conf $f
done
I like the configuration file option better than having to form valid json! The other two options of colons or hyphens are both good but I think the configuration file will likely be more future proof.
Thanks for the input @barrust
INI format? YAML?
I think I have a small preference for YAML but I welcome arguments from others.
If I have time, I may try to mock this up on my flight back to Chicago on Friday. Sounds fun :)
I'm leaning toward INI format with this so people can set it in their project's setup.cfg. I also think we could probably address #96 at the same time which would be very nice.
It's been a long time but this works for now if anyone's interested and checking this out:
from os.path import splitext
from textract import process
switcher = {
"pdf": "pdfminer",
"mp3": "SpeechRecognition"
}
filenames=["hoho.txt", "asdf.pdf", "example.mp3"]
for filename in filenames:
ext = splitext(filename)[1][1:]
method = switcher.get(ext, "")
text = process(filename, method=method)