files_fulltextsearch_tesseract Tesseract OCR on scaned PDFs

Hello,

in Nextcloud it is not possible to index pdf content from scaned dokuments. The reason for this is the pdf file format itself. When you scan a document and save it to pdf there is no "real text layer". So for Nextclouds Plugin "Full Text Search" it is not possible to index the content.

With Tesseract it is only possible to find content in images. For this reason my question:

Can you add some functions and libarys to convert pdf to image. The process should be like this:

User uploads a pdf
This Plugin starts work
the pdf will convert to an image
tesseract will analyse the content
tesseract will safe the image to an new pdf (with same name)

echo (new TesseractOCR('img.png'))
    ->quiet()
    ->run();

This Code snippet will save the image to an pdf and add the searchable text layers to it.

After this process it is possible for users to search for all pdfs.

Jun 02 '21 11:06 techducks

Scanned PDFs should work if php-imagick and ghostscript are available. At least it seems to work on my instance (related pull request is most probably https://github.com/nextcloud/files_fulltextsearch_tesseract/pull/8). Could this issue be closed?

Apr 02 '22 11:04 XueSheng-GIT

So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable? Currently i can not reproduce this because i have no nextcloud server.

Apr 02 '22 12:04 techducks

So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable?

Yes, exactly.

Apr 02 '22 13:04 XueSheng-GIT

So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable?

Yes, exactly.

Maybe this should be confirmed by others before closing the problem?

Apr 02 '22 15:04 techducks

Yes, I can confirm it works but, as a requirement (at least in my Ubuntu distro) you also need to install php-imagick-all-dev (and not only the php-imagick), otherwise, the PDF search feature doesn't work.

Sep 20 '22 07:09 ansani

regular pdf (without any text text layer) , not indexed, does not work.

Oct 09 '22 02:10 luyuanerp

Probably this: https://github.com/nextcloud/files_fulltextsearch_tesseract/issues/16#issue-459995150 ?

Dec 09 '22 00:12 FadeFx

files_fulltextsearch_tesseract files_fulltextsearch_tesseract copied to clipboard

Tesseract OCR on scaned PDFs

files_fulltextsearch_tesseract
files_fulltextsearch_tesseract copied to clipboard