tesserocr icon indicating copy to clipboard operation
tesserocr copied to clipboard

[Question] How to do searchable PDF via tesserocr

Open PenthagonHacker opened this issue 4 years ago • 4 comments

Hello guys! So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf. I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ? I looked through the tesserocr.py and I haven't found anything even remotely close to that.

Thank you beforehand!

PenthagonHacker avatar Jul 19 '21 19:07 PenthagonHacker

Probably better to use OCRmyPDF for this since it’s literally made for that use case.

Tesserocr can help you perform OCR on images, but it doesn’t come with extensive PDF modification utilities built in because that’s outside the scope of the library.

ES-Alexander avatar Aug 09 '21 13:08 ES-Alexander

You can use the ProcessPage method which should be able (if I understand correctly) to output a searchable PDF if you set the tessedit_create_pdf to true. See ProcessPages as well.

sirfz avatar Aug 09 '21 17:08 sirfz

Hello guys! So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf. I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ? I looked through the tesserocr.py and I haven't found anything even remotely close to that.

Thank you beforehand!

Have you tried @sirfz method or found a solution? I am interested in this too

tritium01 avatar Aug 17 '21 16:08 tritium01

import tesserocr

tessdata_path = "tessdata"
outbase = "my_first_pdf"
image_filename = "5.png"
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.ProcessPages(outbase, image_filename)

after applying PR #277 this should works too:

img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.ProcessPage(outputbase=outbase,
                    image=img,
                    page_index=0,
                    filename=image_filename,
                    title="this will be title")

zdenop avatar Sep 15 '21 13:09 zdenop