tesseract RFC: Best Practices re OPENMP - for training, evaluation and recognition

For Tesseract 5 what are the best practices regarding OPENMP.

Is it still true:

OPENMP is needed for training so build tesseract and training tools with --enable-openmp.
For lstmeval (built with --enable-openmp), use OMP_THREAD_LIMIT=1.
For recognition with tesseract (built with --enable-openmp), use OMP_THREAD_LIMIT=1.

Feb 06 '22 05:02 Shreeshrii

OPENMP is not needed for training. It even makes things worse for me. Timing results for lstm_squashed_test on AMD EPYC 7502 show that no OPENMP (--disable-openmp) is best, followed by disabled OPENMP (OMP_THREAD_LIMIT=1). Enabled OPENMP is the last and burns a lot of CPU performance for nothing:

# --disable-openmp
real 28.41
user 28.33
sys 0.08
# --enable-openmp
real 33.16
user 129.41
sys 1.46
# --enable-openmp, OMP_THREAD_LIMIT=1
real 32.89
user 32.61
sys 0.28

Feb 06 '22 11:02 stweil

The plan is to disable it by default in 5.1.0.

Feb 06 '22 11:02 amitdo

... in autoconf builds. cmake already disables it by default.

Feb 06 '22 11:02 stweil

Note that even without OPENMP training uses up to two CPU threads, one for training which runs until training is finished and one for evaluation which runs from time to time during the training process.

Feb 06 '22 11:02 stweil

The reason for disabling OpenMP is that Tesseract currently uses it inefficiently.

For text recognition the speed benefit for using OpenMP with fast / tessdata (best->int) traineddata is too small while it consumes too much CPU resources.

For training the OpenMP code is even more problematic than the code used for text recognition. I'm not sure how much speed will be lost here.

Feb 06 '22 11:02 amitdo

Thank you!

no OPENMP is best, followed by disabled OPENMP

Does no OPENMP mean building with --disable-openmp as part of autotools configure?

Feb 06 '22 12:02 Shreeshrii

Yes, currently it is necessary to use configure --disable-openmp. As Amit has written above that should be the default, but I still have no simple code to achieve that.

I updated my comment to be clearer.

Feb 06 '22 12:02 stweil

--disable-openmp disables OpenMP at compile time, while OMP_THREAD_LIMIT=1 disables it at runtime. The first method is more efficient, while the second method is more flexible.

Feb 06 '22 12:02 amitdo

Stefan, for 5.1.0, do you want to keep a way to enable OpenMP with --enable-openmp?

Feb 06 '22 13:02 amitdo

OPENMP is not needed for training. It even makes things worse for me. Timing results for lstm_squashed_test on AMD EPYC 7502 show that no OPENMP (--disable-openmp) is best, followed by disabled OPENMP (OMP_THREAD_LIMIT=1). Enabled OPENMP is the last and burns a lot of CPU performance for nothing:

Ok. I will try to test training from font scenarios in my tess5train-fonts repo to see if they get similar results.

Feb 06 '22 14:02 Shreeshrii

lstmeval

Which time figures (real, user, sys) are important? Which scenario is preferable?

no OPENMP (--disable-openmp)

tesseract 5.0.1-19-g44ddde
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found NEON
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
 Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3

time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.631000_121_600.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.631000_121_600.eval.log
real 805.37
user 805.34
sys 0.03
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.028000_156_2000.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.028000_156_2000.eval.log
real 806.56
user 806.49
sys 0.07
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.558000_125_700.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.558000_125_700.eval.log
real 806.10
user 806.04
sys 0.07

Enabled OPENMP

older version of tesseract 5.0.1 built with --enable-openmp

time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.645000_119_600.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.645000_119_600.eval.log
real 331.53
user 1041.90
sys 9.02
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.119000_156_1500.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.119000_156_1500.eval.log
real 331.30
user 1042.38
sys 8.55
time -p lstmeval  \
	--verbosity=0 \
	--model data/engFineTuned/tessdata_fast/engFineTuned_0.014000_165_2500.traineddata \
	--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.014000_165_2500.eval.log
real 331.70
user 1042.77
sys 8.97

Feb 06 '22 15:02 Shreeshrii

lstmeval - engImpact

No OPENMP

time -p lstmeval  \
	--verbosity=0 \
	--model data/engImpact/tessdata_fast/engImpact_0.489000_152_900.traineddata \
	--eval_listfile data/engImpact/list.eval 2>&1 | grep "^BCER eval" > data/engImpact/tessdata_fast/engImpact_0.489000_152_900.eval.log
real 19.85
user 19.82
sys 0.04

OPENMP

time -p lstmeval  \
	--verbosity=0 \
	--model data/engImpact/tessdata_fast/engImpact_0.489000_152_900.traineddata \
	--eval_listfile data/engImpact/list.eval 2>&1 | grep "^BCER eval" > data/engImpact/tessdata_fast/engImpact_0.489000_152_900.eval.log
real 8.25
user 25.87
sys 0.27

Feb 06 '22 15:02 Shreeshrii

Which time figures (real, user. sys) are important? Which scenario is preferable?

"real" is the time spent from program start to termination. "user" and "sys" is the accumulated time used by all CPUs in user space / system space. For single threaded applications like Tesseract without OPENMP "real" is normally equal to the sum of "user" and "sys". "real" can also be much larger if the execution is delayed, for example by other applications running simultaneously.

In your test scenario lstmeval was much faster with OPENMP enabled ("real" is 331 s instead of 805 s), so you'd prefer that to get a result fast. The CPU resources where slightly more with OPENMP ("user" 1042 s and "sys" 9 s instead of 805 s / 0.05 s), so the faster execution costs some (acceptable) overhead in this case.

for 5.1.0, do you want to keep a way to enable OpenMP with --enable-openmp?

Yes, I think that's necessary because of compatibility and also because it can be useful as in @Shreeshrii's test case on ARM.

Feb 06 '22 15:02 stweil

Running Tesseract with several threads seems to work better on ARM than on Intel architectures. I noticed that with Apple M1 (AARCH64), too.

Feb 06 '22 15:02 stweil

Running Tesseract with several threads seems to work better on ARM than on Intel architectures. I noticed that with Apple M1 (AARCH64), too.

I am running this on AARCH64.

Feb 06 '22 15:02 Shreeshrii

Also, my tests shows that the enabled OPENMP could make sense in some cases (e.g. for the best data model on Windows & MSVC2019 and Intel processor) It would be great if we found somebody familiar with OpenMP at least for review how tesseract use it...

Feb 06 '22 18:02 zdenop

My timings for OpenMP on Windows MSVC at the end of issue #3044.

Feb 16 '22 22:02 tdhintz

Thanks, @tdhintz

It would be good to know if the results still hold. If possible, please rerun tests with the tesseract 5 released version or latest GitHub version, since there have been many changes since 2020.

Feb 17 '22 02:02 Shreeshrii

@Shreeshrii I'll add that task to our plan for late March. We build with very specific settings to get best results and I'm sure the build process has changed again, so this will be a heavy lift.

Feb 23 '22 13:02 tdhintz

Looks like someone did this already: OpenMP benchmark

Mar 25 '22 18:03 tdhintz

Looks like someone did this already: OpenMP benchmark

That test by @zdenop uses one image 15 times. Your tests use many more combinations.

We ran a comparison between a pre-release of 4.0 and the current 5.0 on AVX2 and SSE hardware on Windows that I'll share just for grins. The 4.0 was built with floating point set to fast, COMDAT folding, OpenMP and was PGO optimized. The 5.0 build also used floating point 'fast' and COMDAT folding, but without OpenMP and without PGO optimization. 2,880 combinations of settings and images were tested for each AVX2 and SSE platform. The tests are by no means comprehensive of all possible combinations. For example, only Eng traindata was used, although the Fast, Best and Blended data were all used.

this will be a heavy lift.

I understand.
If possible to do, the results can be added to tessdoc for easy reference. Thanks.

Mar 26 '22 02:03 Shreeshrii

tesseract tesseract copied to clipboard

RFC: Best Practices re OPENMP - for training, evaluation and recognition

lstmeval

no OPENMP (--disable-openmp)

Enabled OPENMP

lstmeval - engImpact

No OPENMP

OPENMP

tesseract
tesseract copied to clipboard