tika-python How to deal with large pdfs that are all images?

How to deal with large pdfs that are all images?

Open mfernaal opened this issue 2 years ago • 1 comments

I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout.

os.environ['TIKA_SERVER_ENDPOINT'] = 'http://0.0.0.0:9998/'

headers = {
    "X-Tika-OCRLanguage": "eng+nor",
    "X-Tika-PDFextractInlineImages": "true",  # run OCR against inline images
}

data = parser.from_buffer(
    buffer.readall(),
    xmlContent=True, 
    requestOptions={
        "headers": headers, 
        "timeout": 3600
   }
)

Is there any header I'm missing about to handle large files?

I'm using tika-server running it directly on a docker image with this command:

docker run -d -p 9998:9998 apache/tika:1.28.2-full

Thanks for your time!

May 25 '22 11:05 mfernaal

Can you please share and confirm the hardware allotted to the docker server? There is a default amount of hardware resource allotted to docker server. You can increase that by configuring docker and restarting docker service.

Alternatively I would suggest to run tika-server natively on your machine, so that tika-server has complete hardware to work with.

Jul 14 '22 14:07 divyaksh-shukla

yeah I would agree with @divyaksh-shukla I think this is an issue with the underlying docker memory for the tika server you are using. Going to close here and if you find more detail feel free to list it in the comments.

Dec 31 '22 21:12 chrismattmann

tika-python tika-python copied to clipboard

How to deal with large pdfs that are all images?

tika-python
tika-python copied to clipboard