gImageReader
gImageReader copied to clipboard
file size and overlay PDF encoding
I've recently come to use (and like) gImageReader. It does exactly what it's supposed to do, and does a pretty terrific job at it. The only "shortcoming" that I find with respect to (expensive!) commercial solutions is that the resulting PDFs are sometimes very large. I have gone to some lengths to investigate why this might be, and think to have found the reason:
gImareReader seems to encode the images with FlateDecode, whereas the commercial solution I'm comparing to uses JPXDecode with the "mask" feature to separate the background image from the layer with the text and pictures, which seems to allow for some highly efficient and close-to-lossless compression.
Is there a chance of adding this different encoding method also to gImageReader, and if so, where would this need to be added/changed?
I can understand that this might involve some significant overhead and might not be worth the effort at this point, but if this turns out to be relatively straight-forward, I would love to give it a go myself (with some appropriate pointers, of course).
Firstly, it will depend on the PDF export backend you use. If you use QPrinter, images will always get compressed as JPEG. If you choose PoDoFo, depending on the image format, you have more options available. JPXDecode is currently not implemented in PoDoFo, it'll need to be implemented separately, and the result injected into the PDF document, as is currently done with the CCITT encoding, see [1]. If you want to give it a go, I can give you some general pointers about integrating the code in gImageReader.
[1] https://github.com/manisandro/gImageReader/blob/84ad82d9ed77094d5b223daa62d7887eb594f33e/qt/src/hocr/HOCRPdfExporter.cc#L536
Hi! Thanks a lot for the reply, I might actually try it in the next days, especially since I do have some experience with PoDoFo already, so this might be a smooth start. Any pointers would be highly appreciated, I do have some (entry-level) question first though: How and where do I choose the backend? Is this is something that can be chosen at runtime, or is there a CMake option for it? (Sorry if this is in the documentation somehwere, but I didn't find it immediately.)
Cool!
The backend selection is exposed in the PDF Export dialog:

As a user of this program, i agree about the size issue, i wish my output file size were similar to the input one, without compression, and this piece of software would be perfect. 😄
Bump. If a new vers. is going to be dropping due to a new vers. of Tesseract, the ability to mirror the raw input images to the output would be helpful. I have a scan of a 264 page book from an old physical book with no OCR. While the program reports it has text already, it seems not to; but this wasn't a problem. It did an excellent job of recognizing the text in the PDF, and while it took some searching to figure out it could do an invisible text overlay, now I know how, and it works beautifully.
While it is no stopper, my file size of my resulting OCR'ed PDF has increased considerably. Using the lossless png output, it was way too huge. So, I wound up using 50% JPEG compression. The text has degraded subtly upon comparison, and some letters now have gaps in them; but it's tolerable. The file size has increased from ~13 MB to ~71MB. I'd have better graphics quality, and less than a quarter of the file size if the original raw graphics were used. Hopefully, this is not too difficult to implement as a separate image choice.
Again, while this is no stopper, and I love the program, especially after finding out it can output invisible text over the graphics in a PDF, this seems like it would be a major improvement.
Unfortunately it's not trivial, it requires someone with PDF knowhow which I don't really have...
Something is strange for me, no matter how I try to tune the DPI settings, the image in the exported pdf is always a bit different than the one in the original document. Is there a way to leave the original image intact when exporting?