textract
textract copied to clipboard
Extract information from bytes
I have a PDF that I have downloaded, so is not saved as a file yet. How can I use textract to extract the text without actually saving the file?
What do you mean with "downloaded, but not saved as a file yet"?
Textract requires that you specify the path to the pdf file. So far I have only parsed files that have been saved locally. You might try some of the ideas here, but I don't completly understand what you're trying to do.
I get the PDFs from a HTTP response. So, with the body (as bytes) I should be able to extract the pdf from the bytes alone, I do not think it's necessary to save the PDF as a file, to then parse it to extract the text to then delete the created file; when it was already in memory as a Python variable.
Currently, textract does not supports streams. See also #85, #97 and #99. Perhaps this might be able to help you while we work on support for streams.
any progress in byte stream ( file.read() ) or you can suggest any other way out ?
import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
temp.write(f.read())
temp.flush()
context = textract.process(temp.name,encoding='utf-8',extension=".pdf")
import textract with tempfile.NamedTemporaryFile(delete=True) as temp: temp.write(f.read()) temp.flush() context = textract.process(temp.name,encoding='utf-8',extension=".pdf")
That's the solution. Works like a charm and works in the cloud in a stateless function without any filesystem access! Thanks @shzy2012 ! @jpweytjens : Maybe put this workaround in the docs while streams are not yet supported, as its really good for usage cloudbased Thanks