GoBooDo
GoBooDo copied to clipboard
Create seachable PDF with tesseract
This pull request makes GoBooDo to make a searchable PDF with OCR. See also: #58.
How does this work
Tesseractor makes searchable PDFs from images and merge PDFs by PyPDF2.
Usage
If lang
is not in settings.json
or empty, GoBooDo create unsearchable PDF (same as now).
If not empty, GoBooDo create searchable PDF. GoBooDo do OCR as the book is written in item of lang
.
Note
This pull requests increase dependence (PyPDF2). So if user update GoBooDo and haven't installed PyPDF2, no modules error will occur in makePDF.py
.
It takes time to OCR and it is waste of time and electricity to do OCR even though GoBooDo hasn't finished downloading all images (#59).
If user want to do OCR with languages other than English, he or she should install additional language data. And there are other language datas for more accurate OCR (but slow) or for faster.
For OCR, default page_resolution
will be not enough. I use 1200.
Some English sentence should get feedbacks.
Thanks for your contribution! the use case is compelling, can you please add some tests too, so that we can ensure that these changes are not breaking the current functionality. Thanks!