OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

it is not possible to get a text-only pdf searchable?

Open DoctorSubtilis opened this issue 4 years ago • 4 comments

the problem

Converting a scanned pdf to a searchable pdf the size is still very large, bigger than the original one.
That is, I guess, because the text is added to images and not replaces the images.

a solution?

But I wonder if it should not be possible to get an text-only (and so considerably light) pdf, possibly keeping formatting (using the right fonts), but even without formatting (keeping only the page number, that is necessary for academic work) if necessary.
It would be a file similar to an svg one (or to a pdf exported from a odt text file): it is technically impossible?
Thanks!

DoctorSubtilis avatar Apr 18 '20 15:04 DoctorSubtilis

You can try the higher optimization settings (-O3) and make sure you have JBIG2 and pngquant installed, both optional utilities.

Without looking at your file I cannot see why it could not be optimized further. There are instructions available to encrypt the file for my use only if you wish.

It is not within the capabilities of Tesseract OCR to determine what fonts were used in the original document and reconstruct a digital file along those lines. Abbyy OCR (commercial) can do that to a limited extent (only works well with a good scan of a document that uses only common fonts), but usually the output will require extensive manual repair. The advantage of keeping the image and adding OCR is that a human can figure out any mistakes that happen to be in the OCR text.

jbarlow83 avatar Apr 18 '20 21:04 jbarlow83

One more thing - you can use the --sidecar option to get a pure text version of your file, but it's only weakly formatted.

jbarlow83 avatar Apr 18 '20 21:04 jbarlow83

Thank you! I can't follow your first suggestion, because, with KDE Neon based on Kubuntu 18.04, I don't have yet the last release of ocrmypdf.
But I did follow your second tip, and it worked: indeed the txt file, despite its poor formatting, can be opened with LibreOffice and keep the right page-break, what is very important for my work.

DoctorSubtilis avatar Apr 20 '20 07:04 DoctorSubtilis

Install ocrmypdf via pip :) The newest version is always on there.

JulianWgs avatar May 23 '20 17:05 JulianWgs