files_fulltextsearch_tesseract icon indicating copy to clipboard operation
files_fulltextsearch_tesseract copied to clipboard

Tesseract OCR on scaned PDFs

Open techducks opened this issue 3 years ago • 7 comments

Hello,

in Nextcloud it is not possible to index pdf content from scaned dokuments. The reason for this is the pdf file format itself. When you scan a document and save it to pdf there is no "real text layer". So for Nextclouds Plugin "Full Text Search" it is not possible to index the content.

With Tesseract it is only possible to find content in images. For this reason my question:

Can you add some functions and libarys to convert pdf to image. The process should be like this:

  1. User uploads a pdf
  2. This Plugin starts work
  3. the pdf will convert to an image
  4. tesseract will analyse the content
  5. tesseract will safe the image to an new pdf (with same name)
echo (new TesseractOCR('img.png'))
    ->quiet()
    ->run();

This Code snippet will save the image to an pdf and add the searchable text layers to it.

After this process it is possible for users to search for all pdfs.

techducks avatar Jun 02 '21 11:06 techducks

Scanned PDFs should work if php-imagick and ghostscript are available. At least it seems to work on my instance (related pull request is most probably https://github.com/nextcloud/files_fulltextsearch_tesseract/pull/8). Could this issue be closed?

XueSheng-GIT avatar Apr 02 '22 11:04 XueSheng-GIT

So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable? Currently i can not reproduce this because i have no nextcloud server.

techducks avatar Apr 02 '22 12:04 techducks

So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable?

Yes, exactly.

XueSheng-GIT avatar Apr 02 '22 13:04 XueSheng-GIT

So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable?

Yes, exactly.

Maybe this should be confirmed by others before closing the problem?

techducks avatar Apr 02 '22 15:04 techducks

Yes, I can confirm it works but, as a requirement (at least in my Ubuntu distro) you also need to install php-imagick-all-dev (and not only the php-imagick), otherwise, the PDF search feature doesn't work.

ansani avatar Sep 20 '22 07:09 ansani

regular pdf (without any text text layer) , not indexed, does not work.

luyuanerp avatar Oct 09 '22 02:10 luyuanerp

Probably this: https://github.com/nextcloud/files_fulltextsearch_tesseract/issues/16#issue-459995150 ?

FadeFx avatar Dec 09 '22 00:12 FadeFx