files_fulltextsearch_tesseract
files_fulltextsearch_tesseract copied to clipboard
avoid OCR on non-image PDF files
It appears that this app uses OCR even if the PDF file is not a scanned-type.
For example, I have a fresh Nextcloud installation and I see php occ fulltextsearch:index
taking a lot of time processing Nextcloud Manual.pdf
(a 99 pages PDF that comes with Nextcloud) and tesseract
is working hard scanning it... That's simply useless.
I would suggest checking if the PDF contains and text nodes and avoid Tesseract in that case.
This still seems to be an issue. I'm on NC23.0.3 with fulltextsearch tesseract 22.0.0. Most of my PDF files do contain a text layer, but all PDFs seem to be processed by tesseract which seems to be a waste of resources. Any easy way to detect whether a PDF does contain a text layer and just skip those?