textract icon indicating copy to clipboard operation
textract copied to clipboard

Support for streams

Open apolkosnik-old opened this issue 9 years ago • 6 comments

I'd like to propose a feature that removes reliance on file extensions, and brings a much greater flexibility for users by accepting streams as input to textract.

apolkosnik-old avatar Nov 05 '15 15:11 apolkosnik-old

Sounds like a really interesting idea. Would you like to propose the command line interface for what that could look like?

My main concern is for deciding how we route the inbound content to the appropriate parser. FWIW, I recently took a stab at using mimetypes to detect the type of parser that we should use (see #89), which had pretty :poop:-y results

deanmalmgren avatar Nov 05 '15 16:11 deanmalmgren

My approach to the poor results is to try to run through all possible extensions for the given mimetype until one sticks. It's a bit crude, but seems to work with couple of files that I've tested.

apolkosnik-old avatar Nov 06 '15 07:11 apolkosnik-old

I've created a pull #99, I'd love some feedback. Thanks!

apolkosnik-old avatar Nov 09 '15 20:11 apolkosnik-old

Hello! Is this idea still being pursued? I have a use case where this would be very useful :}~

@frbapolkosnik @deanmalmgren

josepablog avatar Oct 24 '16 23:10 josepablog

I would also like to see this happen 👍

kennell avatar Jul 31 '17 14:07 kennell

The output from the console is that it needs to be a String, Bytes, ... but this is a generic message, so the underlying tools support Bytes/Streams I was hoping eg process(file.Read(),extension="txt") or whatever would work, but I see there's also requests to auto-detect the extension

filipopo avatar Mar 14 '20 17:03 filipopo