tesseract PDF Renderer: allow to specify an alternate image or a custom resolution.

Motivation

Input images passed to OCR are often pre-processed (higher resolution, grayed, etc...). It can be useful to specify an alternate image or a lower resolution in renderer, especially for a searchable pdf export.

Proposed changes

Added TessResultRenderer::SetRenderingImage or TessResultRenderer::SetRenderingResolution methods allow to programmatically change image or resolution to render before adding image to the renderer
New rendering_dpi param allows to override the output resolution by scaling the source image
Added few pdfrenderer tests
Fix missing pdf.ttf font in the cmake install target

These changes might resolve #210 and #3798 features request.

Checks

[X] make check passed locally on ubuntu 23.10
[ ] GitHub workflows passed

Dec 19 '23 00:12 phymbert

cmake does not install a PDF font file. It was the old way, how to handle font in pdf. Now it is automatically included in library

Dec 20 '23 07:12 zdenop

@jbreiden @jbreiden2 : Jeff can you have a look at this?

Dec 20 '23 07:12 zdenop

cmake does not install a PDF font file. It was the old way, how to handle font in pdf. Now it is automatically included in library

Thanks, @zdenop, for the explanation. I was confused with the tessdata/Makefile.am, and I will remove it. Let me submit pdfrendrerer test fixes, it failed on some platforms.

@jbreiden @jbreiden2, a better way to check pdf files generated than file maximum size is welcomed

Dec 20 '23 10:12 phymbert

It looks there is a little interest, that happens :) Thanks all

Apr 14 '24 18:04 phymbert

Hi, it's not unusual that pull requests take some time before they are merged. That does not necessarily mean that there is little interest, but there is only a small number of people who contribute to pull requests by adding comments or testing them.

Apr 14 '24 21:04 stweil

No worries at all, I just saw it open on my to-do list for a while, so I preferred to close. Thanks for your feedback, I understand, reopened, no hurry.

Apr 14 '24 21:04 phymbert

Since it extends the API functionality, it should be included in the 5.4.0 release.

Apr 18 '24 09:04 zdenop

I rebased this pull request and fixed a merge conflict.

Apr 19 '24 19:04 stweil

What about implementing this feature also to tesseract executable as a command line option?

Apr 19 '24 20:04 zdenop

Isn't that already possible with -c?

Apr 20 '24 05:04 stweil

Isn't that already possible with -c?

With -c I can set rendering_dpi. How can I set an image for SetRenderingImage?

Apr 20 '24 17:04 zdenop

Tesseract can create multi-page PDF files when it is called with a list of images. Ideally that should also work with alternate images.

May 19 '24 16:05 stweil

Isn't that already possible with -c?

With -c I can set rendering_dpi. How can I set an image for SetRenderingImage?

Would it be possible to implement the desired features by only adding new Tesseract parameters – without any change of the C / C++ API?

May 19 '24 16:05 stweil

tesseract tesseract copied to clipboard

PDF Renderer: allow to specify an alternate image or a custom resolution.

tesseract
tesseract copied to clipboard