files_fulltextsearch_tesseract
files_fulltextsearch_tesseract copied to clipboard
Tesseract OCR on scaned PDFs
Hello,
in Nextcloud it is not possible to index pdf content from scaned dokuments. The reason for this is the pdf file format itself. When you scan a document and save it to pdf there is no "real text layer". So for Nextclouds Plugin "Full Text Search" it is not possible to index the content.
With Tesseract it is only possible to find content in images. For this reason my question:
Can you add some functions and libarys to convert pdf to image. The process should be like this:
- User uploads a pdf
- This Plugin starts work
- the pdf will convert to an image
- tesseract will analyse the content
- tesseract will safe the image to an new pdf (with same name)
echo (new TesseractOCR('img.png'))
->quiet()
->run();
This Code snippet will save the image to an pdf and add the searchable text layers to it.
After this process it is possible for users to search for all pdfs.
Scanned PDFs should work if php-imagick and ghostscript are available. At least it seems to work on my instance (related pull request is most probably https://github.com/nextcloud/files_fulltextsearch_tesseract/pull/8). Could this issue be closed?
So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable? Currently i can not reproduce this because i have no nextcloud server.
So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable?
Yes, exactly.
So on your instance it is possible to upload an regular pdf (without any text text layer) which is then searchable?
Yes, exactly.
Maybe this should be confirmed by others before closing the problem?
Yes, I can confirm it works but, as a requirement (at least in my Ubuntu distro) you also need to install php-imagick-all-dev (and not only the php-imagick), otherwise, the PDF search feature doesn't work.
regular pdf (without any text text layer) , not indexed, does not work.
Probably this: https://github.com/nextcloud/files_fulltextsearch_tesseract/issues/16#issue-459995150 ?