tesserocr [Question] How to do searchable PDF via tesserocr

Hello guys! So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf. I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ? I looked through the tesserocr.py and I haven't found anything even remotely close to that.

Thank you beforehand!

Jul 19 '21 19:07 PenthagonHacker

Probably better to use OCRmyPDF for this since it’s literally made for that use case.

Tesserocr can help you perform OCR on images, but it doesn’t come with extensive PDF modification utilities built in because that’s outside the scope of the library.

Aug 09 '21 13:08 ES-Alexander

You can use the ProcessPage method which should be able (if I understand correctly) to output a searchable PDF if you set the tessedit_create_pdf to true. See ProcessPages as well.

Aug 09 '21 17:08 sirfz

Hello guys! So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf. I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ? I looked through the tesserocr.py and I haven't found anything even remotely close to that.

Thank you beforehand!

Have you tried @sirfz method or found a solution? I am interested in this too

Aug 17 '21 16:08 tritium01

import tesserocr

tessdata_path = "tessdata"
outbase = "my_first_pdf"
image_filename = "5.png"
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.ProcessPages(outbase, image_filename)

after applying PR #277 this should works too:

img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
    api.SetVariable("tessedit_create_pdf", "true")
    api.ProcessPage(outputbase=outbase,
                    image=img,
                    page_index=0,
                    filename=image_filename,
                    title="this will be title")

Sep 15 '21 13:09 zdenop