textract Extract information from bytes

I have a PDF that I have downloaded, so is not saved as a file yet. How can I use textract to extract the text without actually saving the file?

Aug 25 '19 21:08 asciidiego

What do you mean with "downloaded, but not saved as a file yet"?

Textract requires that you specify the path to the pdf file. So far I have only parsed files that have been saved locally. You might try some of the ideas here, but I don't completly understand what you're trying to do.

Aug 27 '19 08:08 jpweytjens

I get the PDFs from a HTTP response. So, with the body (as bytes) I should be able to extract the pdf from the bytes alone, I do not think it's necessary to save the PDF as a file, to then parse it to extract the text to then delete the created file; when it was already in memory as a Python variable.

Aug 27 '19 09:08 asciidiego

Currently, textract does not supports streams. See also #85, #97 and #99. Perhaps this might be able to help you while we work on support for streams.

Aug 27 '19 10:08 jpweytjens

any progress in byte stream ( file.read() ) or you can suggest any other way out ?

Aug 29 '20 08:08 multinucliated

import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
    temp.write(f.read())
    temp.flush()
    context = textract.process(temp.name,encoding='utf-8',extension=".pdf")

Jul 06 '21 03:07 shzy2012

import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
    temp.write(f.read())
    temp.flush()
    context = textract.process(temp.name,encoding='utf-8',extension=".pdf")

That's the solution. Works like a charm and works in the cloud in a stateless function without any filesystem access! Thanks @shzy2012 ! @jpweytjens : Maybe put this workaround in the docs while streams are not yet supported, as its really good for usage cloudbased Thanks

Apr 08 '23 17:04 uxtt2000

textract textract copied to clipboard

Extract information from bytes

textract
textract copied to clipboard