[Question] How to do searchable PDF via tesserocr
Hello guys! So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf. I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ? I looked through the tesserocr.py and I haven't found anything even remotely close to that.
Thank you beforehand!
Probably better to use OCRmyPDF for this since it’s literally made for that use case.
Tesserocr can help you perform OCR on images, but it doesn’t come with extensive PDF modification utilities built in because that’s outside the scope of the library.
You can use the ProcessPage method which should be able (if I understand correctly) to output a searchable PDF if you set the tessedit_create_pdf to true. See ProcessPages as well.
Hello guys! So I am completely new to tesseract and tesserocr. I need to make pdf file with the text layer a.k.a, searchable pdf. I found in tesseract documentation that there's such thing as TessPDFRenderer So my question is there any way I can use this method via tesserocr and pycharm ? I looked through the tesserocr.py and I haven't found anything even remotely close to that.
Thank you beforehand!
Have you tried @sirfz method or found a solution? I am interested in this too
import tesserocr
tessdata_path = "tessdata"
outbase = "my_first_pdf"
image_filename = "5.png"
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
api.SetVariable("tessedit_create_pdf", "true")
api.ProcessPages(outbase, image_filename)
after applying PR #277 this should works too:
img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
api.SetVariable("tessedit_create_pdf", "true")
api.ProcessPage(outputbase=outbase,
image=img,
page_index=0,
filename=image_filename,
title="this will be title")