GoBooDo icon indicating copy to clipboard operation
GoBooDo copied to clipboard

Create seachable PDF with tesseract

Open minamotorin opened this issue 3 years ago • 1 comments

This pull request makes GoBooDo to make a searchable PDF with OCR. See also: #58.

How does this work

Tesseractor makes searchable PDFs from images and merge PDFs by PyPDF2.

Usage

If lang is not in settings.json or empty, GoBooDo create unsearchable PDF (same as now).

If not empty, GoBooDo create searchable PDF. GoBooDo do OCR as the book is written in item of lang.

Note

This pull requests increase dependence (PyPDF2). So if user update GoBooDo and haven't installed PyPDF2, no modules error will occur in makePDF.py.

It takes time to OCR and it is waste of time and electricity to do OCR even though GoBooDo hasn't finished downloading all images (#59).

If user want to do OCR with languages other than English, he or she should install additional language data. And there are other language datas for more accurate OCR (but slow) or for faster.

For OCR, default page_resolution will be not enough. I use 1200.

Some English sentence should get feedbacks.

minamotorin avatar Sep 28 '21 19:09 minamotorin

Thanks for your contribution! the use case is compelling, can you please add some tests too, so that we can ensure that these changes are not breaking the current functionality. Thanks!

vaibhavk97 avatar Nov 19 '22 18:11 vaibhavk97