textract
textract copied to clipboard
Support for streams
I'd like to propose a feature that removes reliance on file extensions, and brings a much greater flexibility for users by accepting streams as input to textract.
Sounds like a really interesting idea. Would you like to propose the command line interface for what that could look like?
My main concern is for deciding how we route the inbound content to the appropriate parser. FWIW, I recently took a stab at using mimetypes to detect the type of parser that we should use (see #89), which had pretty :poop:-y results
My approach to the poor results is to try to run through all possible extensions for the given mimetype until one sticks. It's a bit crude, but seems to work with couple of files that I've tested.
I've created a pull #99, I'd love some feedback. Thanks!
Hello! Is this idea still being pursued? I have a use case where this would be very useful :}~
@frbapolkosnik @deanmalmgren
I would also like to see this happen 👍
The output from the console is that it needs to be a String, Bytes, ... but this is a generic message, so the underlying tools support Bytes/Streams
I was hoping eg
process(file.Read(),extension="txt")
or whatever would work, but I see there's also requests to auto-detect the extension