OCRmyPDF
OCRmyPDF copied to clipboard
it is not possible to get a text-only pdf searchable?
the problem
Converting a scanned pdf to a searchable pdf the size is still very large, bigger than the original one.
That is, I guess, because the text is added to images and not replaces the images.
a solution?
But I wonder if it should not be possible to get an text-only (and so considerably light) pdf, possibly keeping formatting (using the right fonts), but even without formatting (keeping only the page number, that is necessary for academic work) if necessary.
It would be a file similar to an svg one (or to a pdf exported from a odt text file): it is technically impossible?
Thanks!
You can try the higher optimization settings (-O3) and make sure you have JBIG2 and pngquant installed, both optional utilities.
Without looking at your file I cannot see why it could not be optimized further. There are instructions available to encrypt the file for my use only if you wish.
It is not within the capabilities of Tesseract OCR to determine what fonts were used in the original document and reconstruct a digital file along those lines. Abbyy OCR (commercial) can do that to a limited extent (only works well with a good scan of a document that uses only common fonts), but usually the output will require extensive manual repair. The advantage of keeping the image and adding OCR is that a human can figure out any mistakes that happen to be in the OCR text.
One more thing - you can use the --sidecar
option to get a pure text version of your file, but it's only weakly formatted.
Thank you! I can't follow your first suggestion, because, with KDE Neon based on Kubuntu 18.04, I don't have yet the last release of ocrmypdf.
But I did follow your second tip, and it worked: indeed the txt file, despite its poor formatting, can be opened with LibreOffice and keep the right page-break, what is very important for my work.
Install ocrmypdf via pip :) The newest version is always on there.