tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

[Tesseract 4] Parallel Processing of Pages / Better Performance

Open MichaelPeter opened this issue 4 years ago • 6 comments

Hello everybody,

I currently have the case that Fast Processing takes 4 seconds per Page and Best Processing takes 9 seconds per page but I still don't have really load on the cpu of the test maschines (So far my laptop).

Now since we have 70 Page Pdf Documents (which are converted to Tiff and then Pix) Documents I thought about processing the pages parallely and then output them as (Text only) pdf. Since 70 x 4 Seconds is 4,6 Minutes.

Now I get an exception if I try to reconize mulitple pages at the same time. since ResultRenderer.AddPage calls page.Recognize() an this does not allow multiple pages to be recognized at the same time. I don't find the error message.

So to improve performance and cpu usage. Should I start 5 TesseractEngines for every page or are there better methods to increase utilization I should look into first?

Thank you all for your help

MichaelPeter avatar Sep 27 '20 18:09 MichaelPeter

One engine per page, make sure avx2 is available on the OS and CPU, and measure performance with and without Open mp.On Sep 27, 2020 1:41 PM, MichaelPeter [email protected] wrote:[External email: Use caution! Do not open attachments or click on links from unknown senders or unexpected emails.]

Hello everybody, I currently have the case that Fast Processing takes 4 seconds per Page and Best Processing takes 9 seconds per page but I still don't have really load on the cpu of the test maschines (So far my laptop). Now since we have 70 Page Pdf Documents (which are converted to Tiff and then Pix) Documents I thought about processing the pages parallely and then output them as (Text only) pdf. Since 70 x 4 Seconds is 4,6 Minutes. Now I get an exception if I try to reconize mulitple pages at the same time. since ResultRenderer.AddPage calls page.Recognize() an this does not allow multiple pages to be recognized at the same time. I don't find the error message. So to improve performance and cpu usage. Should I start 5 TesseractEngines for every page or are there better methods to increase utilization I should look into first? Thank you all for your help

—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or unsubscribe.

tdhintz avatar Sep 27 '20 18:09 tdhintz

thank you very much for the quick answer :)

MichaelPeter avatar Sep 27 '20 18:09 MichaelPeter

Performance is better with clean images. Noise and complexity causes considerable performance loss.

tdhintz avatar Sep 28 '20 08:09 tdhintz

One more thought... Let Leptonica load the image directly from the Tiff rather than doing your own conversion to pix.

tdhintz avatar Sep 28 '20 08:09 tdhintz

Currently I use Ghostscript to convert the pdf pages to a 300x300 pixel tiff and save it as tiffg4 - so black/white tiff and then load it using PixArray.LoadMultiPageTiffFromFile, so that is what you mean with Leptonica?

When you say noise causes performance loss, should I do more image preprocessing? Like convert to differnt Black/White format or remove the noise using preprocessing

Or applying a GaussianBlur https://www.freecodecamp.org/news/getting-started-with-tesseract-part-ii-f7f9a0899b3f/

Thanks and greetings Michael

MichaelPeter avatar Sep 28 '20 09:09 MichaelPeter

Yes, I mean use the load from multi page tiff.

There are many PDF architectures. If these are scanned or converted from microfilm they'll be dirty images wrapped in PDF wrappers so cleanup can help. If they are machine generated they might already be text documents and cleanup won't help.

tdhintz avatar Sep 28 '20 10:09 tdhintz