tesseract Run LSTM recognition in multiple threads

Init time option lstm_num_threads should be used to set the number of LSTM threads. This will ensure that word recognition can run independently in multiple threads, thus effectively utilizing multi-core processors.

Following are my test results for a sample screenshot. CPU : Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz OS : WIndows Compiler : MSVC 19.38.33130.0 (Installed from Visual Studio 2022) Model: eng.traineddata from tessfast PSM: 6

Total time taken for Recognize API call, Built without OpenMP With lstm_num_threads=1, total time taken = 3.95 seconds With lstm_num_threads=4, total time taken = 1.4 seconds

On the other hand, here are the numbers with OpenMP OMP_THREAD_LIMIT not set, total time taken = 3.59 seconds OMP_THREAD_LIMIT=4, total time taken = 3.57 seconds OMP_THREAD_LIMIT=1, total time taken = 4.19 seconds

As we can observe, this branch with lstm_num_threads set as 4, performs way better than the openmp multithreading supported currently. Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Jun 27 '24 13:06 jkarthic

Many thanks for this nice contribution.

With this pull request users have the choice of using the new argument --lstm-num-threads N or setting the new parameter with -c lstm-num-threads=N. Do we need both ways? If a command line argument is desired (like in the case of --dpi), I think that there might be more user friendly variants. Although --lstm-num-thread describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread arguments in the future? Or would --threads be sufficient?

Maybe we could also extend the command line syntax to have --PARAMETER VALUE as an alternative for -c PARAMETER=VALUE for any Tesseract parameter.

Jun 27 '24 21:06 stweil

Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.

Jun 27 '24 21:06 stweil

Many thanks for this nice contribution.

And many thanks to you for reviewing this patiently.

With this pull request users have the choice of using the new argument --lstm-num-threads N or setting the new parameter with -c lstm-num-threads=N. Do we need both ways?

This lstm_num_threads is a init time parameter. The LSTMRecognizer instances are created during init. Setting this new parameter with -c lstm-num-threads=N will not work, as it is setting the variable after the init is done.

Although --lstm-num-thread describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread arguments in the future? Or would --threads be sufficient?

When I tested tesseract with a psm of 3(which is the default for tesseract.exe), page segmentation was taking significantly more time than the actual LSTM recognition. For example, in one of my tests, page segmentation was taking ~7 seconds, and lstm was taking ~3 seconds, taking the total to ~10 seconds. Users running with default psm parameter should not expect that the entire 10 seconds will be run in multiple threads. In this case, the major part of ~7 seconds will run single threaded and only a minor part of ~3 seconds will be multi threaded. Hence I thought adding a longer name is setting the user expectation right, that only a portion of tesseract will be running multithreaded. Also there are other numthreads variables related to OpenMP, inside the code which were named generically such as kNumThreads, __num_threads and num_threads. Naming this as lstm_num_threads also differentiates this as a seperate variable, not to be confused with OpenMP num threads.

Jun 28 '24 05:06 jkarthic

Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.

Totally agreed. This is meant for latency-sensitive real-time applications, with ocr probably running in the consumer's device itself.

Jun 28 '24 05:06 jkarthic

@stweil I observed a crash issue in the earlier code due to WERD_RES objects freed by one thread was used by another thread for iterating thru the WERD_RES singly linked list. To fix the above above issue, I have modified WERD_RES linked list to use shared pointer instead of raw pointers, so that lifetime of the objects are managed automatically. I have also added mutex protections around the PAGE_RES_IT functions that modify this list in order to avoid race conditions. Please take a look at the modifications whenever you get some time for this.

Jul 05 '24 14:07 jkarthic

I suggest to use previous version as base.

Jul 05 '24 15:07 egorpugin

Now it is much much worse.

@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?

Jul 05 '24 16:07 jkarthic

@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?

More complex code.
A lot of sync.
Much harder to review.
Most likely a 'no go' in current state.

You need to provide a very detailed description of:

algorithm. How it works? Is it possible to sync less?
changes in files. I see new types, mutex locks in some existing functions. See example how this can be described from gcc commit messages, e.g. https://github.com/gcc-mirror/gcc/commit/5185274c76cc3b68a38713273779ec29ae4fe5d2 (bottom part of the commit message)

Jul 05 '24 16:07 egorpugin

I tested this PR on my Mac (M1 chip). I have a few observations to share:

The latest commit (d1eed6a) does not compile successfully on my system. I encountered multiple errors related to ELIST_ITERATOR_T, BLOBNBOX_IT, and other classes. It seems some changes may be missing or incomplete in this commit.
The initial commit (6a2e239) does compile and produces a working binary. Using the new --lstm-num-threads option, I was able to achieve a ~2x speedup (compared to 5.4.1 + OpenMP) by finding the optimal N value for my system.
However, I noticed an issue with language/script recognition when using multiple languages. For example:
- Using -l "eng+ukr" tries to recognize Cyrillic chars as Latin chars
- Using -l "ukr+eng" vice-versa: incorrectly recognizes Cyrillic in places with Latin chars

This behavior differs from Tesseract 5.4.1, which produces correct output for the same inputs in both cases.

I hope this feedback is helpful.

Sep 25 '24 12:09 burningfireplace

@burningfireplace

Thanks for trying it out and providing a detailed analysis. Here is my reply.

I was using windows while developing this. Never tested this on Mac, as mac had apple's inbuilt OCR Vision API, which in my experience performs better than tesseract. But that's no reason for breaking the build on Mac. I will try to fix the compile issues on Mac, when I get some time.
Nice to know, this is improving speed on Mac as well.
The initial commit had a few race condition bugs that might be causing this. Once compile issues are resolved, I would encourage you to try the latest commit once again. Ideally the latest commit should provide identical output as the single threaded version.

Sep 25 '24 15:09 jkarthic

tesseract tesseract copied to clipboard

Run LSTM recognition in multiple threads

tesseract
tesseract copied to clipboard