files_fulltextsearch_tesseract avoid OCR on non-image PDF files

avoid OCR on non-image PDF files

Open jampy opened this issue 4 years ago • 1 comments

It appears that this app uses OCR even if the PDF file is not a scanned-type.

For example, I have a fresh Nextcloud installation and I see php occ fulltextsearch:index taking a lot of time processing Nextcloud Manual.pdf (a 99 pages PDF that comes with Nextcloud) and tesseract is working hard scanning it... That's simply useless.

I would suggest checking if the PDF contains and text nodes and avoid Tesseract in that case.

Aug 19 '20 13:08 jampy

This still seems to be an issue. I'm on NC23.0.3 with fulltextsearch tesseract 22.0.0. Most of my PDF files do contain a text layer, but all PDFs seem to be processed by tesseract which seems to be a waste of resources. Any easy way to detect whether a PDF does contain a text layer and just skip those?

Apr 02 '22 11:04 XueSheng-GIT

files_fulltextsearch_tesseract files_fulltextsearch_tesseract copied to clipboard

avoid OCR on non-image PDF files

files_fulltextsearch_tesseract
files_fulltextsearch_tesseract copied to clipboard