tika-python
tika-python copied to clipboard
How to deal with large pdfs that are all images?
I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout.
os.environ['TIKA_SERVER_ENDPOINT'] = 'http://0.0.0.0:9998/'
headers = {
"X-Tika-OCRLanguage": "eng+nor",
"X-Tika-PDFextractInlineImages": "true", # run OCR against inline images
}
data = parser.from_buffer(
buffer.readall(),
xmlContent=True,
requestOptions={
"headers": headers,
"timeout": 3600
}
)
Is there any header I'm missing about to handle large files?
I'm using tika-server running it directly on a docker image with this command:
docker run -d -p 9998:9998 apache/tika:1.28.2-full
Thanks for your time!
Can you please share and confirm the hardware allotted to the docker server? There is a default amount of hardware resource allotted to docker server. You can increase that by configuring docker and restarting docker service.
Alternatively I would suggest to run tika-server natively on your machine, so that tika-server has complete hardware to work with.
yeah I would agree with @divyaksh-shukla I think this is an issue with the underlying docker memory for the tika server you are using. Going to close here and if you find more detail feel free to list it in the comments.