tesseract
tesseract copied to clipboard
RFC: Best Practices re OPENMP - for training, evaluation and recognition
For Tesseract 5 what are the best practices regarding OPENMP.
Is it still true:
- OPENMP is needed for training so build tesseract and training tools with
--enable-openmp
. - For
lstmeval
(built with--enable-openmp
), useOMP_THREAD_LIMIT=1
. - For recognition with
tesseract
(built with--enable-openmp
), useOMP_THREAD_LIMIT=1
.
OPENMP is not needed for training. It even makes things worse for me. Timing results for lstm_squashed_test
on AMD EPYC 7502 show that no OPENMP (--disable-openmp
) is best, followed by disabled OPENMP (OMP_THREAD_LIMIT=1
). Enabled OPENMP is the last and burns a lot of CPU performance for nothing:
# --disable-openmp
real 28.41
user 28.33
sys 0.08
# --enable-openmp
real 33.16
user 129.41
sys 1.46
# --enable-openmp, OMP_THREAD_LIMIT=1
real 32.89
user 32.61
sys 0.28
The plan is to disable it by default in 5.1.0.
... in autoconf builds. cmake already disables it by default.
Note that even without OPENMP training uses up to two CPU threads, one for training which runs until training is finished and one for evaluation which runs from time to time during the training process.
The reason for disabling OpenMP is that Tesseract currently uses it inefficiently.
For text recognition the speed benefit for using OpenMP with fast / tessdata (best->int) traineddata is too small while it consumes too much CPU resources.
For training the OpenMP code is even more problematic than the code used for text recognition. I'm not sure how much speed will be lost here.
Thank you!
no OPENMP is best, followed by disabled OPENMP
Does no OPENMP
mean building with --disable-openmp
as part of autotools configure?
Yes, currently it is necessary to use configure --disable-openmp
. As Amit has written above that should be the default, but I still have no simple code to achieve that.
I updated my comment to be clearer.
--disable-openmp
disables OpenMP at compile time, while OMP_THREAD_LIMIT=1
disables it at runtime. The first method is more efficient, while the second method is more flexible.
Stefan, for 5.1.0, do you want to keep a way to enable OpenMP with --enable-openmp
?
OPENMP is not needed for training. It even makes things worse for me. Timing results for lstm_squashed_test on AMD EPYC 7502 show that no OPENMP (--disable-openmp) is best, followed by disabled OPENMP (OMP_THREAD_LIMIT=1). Enabled OPENMP is the last and burns a lot of CPU performance for nothing:
Ok. I will try to test training from font scenarios in my tess5train-fonts repo to see if they get similar results.
lstmeval
Which time
figures (real, user, sys) are important? Which scenario is preferable?
no OPENMP (--disable-openmp)
tesseract 5.0.1-19-g44ddde
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found NEON
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
time -p lstmeval \
--verbosity=0 \
--model data/engFineTuned/tessdata_fast/engFineTuned_0.631000_121_600.traineddata \
--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.631000_121_600.eval.log
real 805.37
user 805.34
sys 0.03
time -p lstmeval \
--verbosity=0 \
--model data/engFineTuned/tessdata_fast/engFineTuned_0.028000_156_2000.traineddata \
--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.028000_156_2000.eval.log
real 806.56
user 806.49
sys 0.07
time -p lstmeval \
--verbosity=0 \
--model data/engFineTuned/tessdata_fast/engFineTuned_0.558000_125_700.traineddata \
--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.558000_125_700.eval.log
real 806.10
user 806.04
sys 0.07
Enabled OPENMP
older version of tesseract 5.0.1 built with --enable-openmp
time -p lstmeval \
--verbosity=0 \
--model data/engFineTuned/tessdata_fast/engFineTuned_0.645000_119_600.traineddata \
--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.645000_119_600.eval.log
real 331.53
user 1041.90
sys 9.02
time -p lstmeval \
--verbosity=0 \
--model data/engFineTuned/tessdata_fast/engFineTuned_0.119000_156_1500.traineddata \
--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.119000_156_1500.eval.log
real 331.30
user 1042.38
sys 8.55
time -p lstmeval \
--verbosity=0 \
--model data/engFineTuned/tessdata_fast/engFineTuned_0.014000_165_2500.traineddata \
--eval_listfile data/engFineTuned/list.eval 2>&1 | grep "^BCER eval" > data/engFineTuned/tessdata_fast/engFineTuned_0.014000_165_2500.eval.log
real 331.70
user 1042.77
sys 8.97
lstmeval - engImpact
No OPENMP
time -p lstmeval \
--verbosity=0 \
--model data/engImpact/tessdata_fast/engImpact_0.489000_152_900.traineddata \
--eval_listfile data/engImpact/list.eval 2>&1 | grep "^BCER eval" > data/engImpact/tessdata_fast/engImpact_0.489000_152_900.eval.log
real 19.85
user 19.82
sys 0.04
OPENMP
time -p lstmeval \
--verbosity=0 \
--model data/engImpact/tessdata_fast/engImpact_0.489000_152_900.traineddata \
--eval_listfile data/engImpact/list.eval 2>&1 | grep "^BCER eval" > data/engImpact/tessdata_fast/engImpact_0.489000_152_900.eval.log
real 8.25
user 25.87
sys 0.27
Which
time
figures (real, user. sys) are important? Which scenario is preferable?
"real" is the time spent from program start to termination. "user" and "sys" is the accumulated time used by all CPUs in user space / system space. For single threaded applications like Tesseract without OPENMP "real" is normally equal to the sum of "user" and "sys". "real" can also be much larger if the execution is delayed, for example by other applications running simultaneously.
In your test scenario lstmeval
was much faster with OPENMP enabled ("real" is 331 s instead of 805 s), so you'd prefer that to get a result fast. The CPU resources where slightly more with OPENMP ("user" 1042 s and "sys" 9 s instead of 805 s / 0.05 s), so the faster execution costs some (acceptable) overhead in this case.
for 5.1.0, do you want to keep a way to enable OpenMP with
--enable-openmp
?
Yes, I think that's necessary because of compatibility and also because it can be useful as in @Shreeshrii's test case on ARM.
Running Tesseract with several threads seems to work better on ARM than on Intel architectures. I noticed that with Apple M1 (AARCH64), too.
Running Tesseract with several threads seems to work better on ARM than on Intel architectures. I noticed that with Apple M1 (AARCH64), too.
I am running this on AARCH64.
Also, my tests shows that the enabled OPENMP could make sense in some cases (e.g. for the best data model on Windows & MSVC2019 and Intel processor) It would be great if we found somebody familiar with OpenMP at least for review how tesseract use it...
My timings for OpenMP on Windows MSVC at the end of issue #3044.
Thanks, @tdhintz
It would be good to know if the results still hold. If possible, please rerun tests with the tesseract 5 released version or latest GitHub version, since there have been many changes since 2020.
@Shreeshrii I'll add that task to our plan for late March. We build with very specific settings to get best results and I'm sure the build process has changed again, so this will be a heavy lift.
Looks like someone did this already: OpenMP benchmark
Looks like someone did this already: OpenMP benchmark
That test by @zdenop uses one image 15 times. Your tests use many more combinations.
We ran a comparison between a pre-release of 4.0 and the current 5.0 on AVX2 and SSE hardware on Windows that I'll share just for grins. The 4.0 was built with floating point set to fast, COMDAT folding, OpenMP and was PGO optimized. The 5.0 build also used floating point 'fast' and COMDAT folding, but without OpenMP and without PGO optimization. 2,880 combinations of settings and images were tested for each AVX2 and SSE platform. The tests are by no means comprehensive of all possible combinations. For example, only Eng traindata was used, although the Fast, Best and Blended data were all used.
this will be a heavy lift.
I understand.
If possible to do, the results can be added to tessdoc for easy reference. Thanks.