tesseract
tesseract copied to clipboard
Different results on debian machines compared to windows & mac
Environment
- Tesseract Version: 5.0.1
- Commit Number: 424b17f997363670d187f42c43408c472fe55053
- Platform: Linux girid 4.19.0-20-amd64 # 1 SMP Debian 4.19.235-1 x86_64 GNU/Linux
Current Behavior:
I am using the following version of tesseract on a debian machine.
$ tesseract --version tesseract 5.0.1-42-g424b leptonica-1.83.0 libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 Found AVX2 Found AVX Found FMA Found SSE4.1
And I am trying to match its accuracy across the three platforms (win, mac, and debian) for my application. However, I noticed that debian is producing different results compared to the other 2 platform.
I have attached a sample image in which the results were different. I tried disabling the optimizations related to AVX2, AVX, FMA and SSE and that did not work.
Expected Behavior:
Ideally, the models should produce the same result on the same image across platforms.
I just have run OCR for that image on MacOS arm64 and Debian x86_64. Here is the result:
--- debian/167471110-60ca8f1d-9db7-4eb1-ae3a-f82b770429f3.txt 2022-05-09 20:23:07.660615639 +0200
+++ mac/167471110-60ca8f1d-9db7-4eb1-ae3a-f82b770429f3.txt 2022-05-09 20:19:07.000000000 +0200
@@ -9,7 +9,7 @@
The Tesseract OCR engine, as was the HP Research
Prototype in the UNLV Fourth Annual Test of OCR
Accuracy{I}, is described in a comprehensive
-overview, Emphasis is placed on aspects that are novel
+overview. Emphasis is placed on aspects that are novel
or at least unusual in an OCR engine, including in
particular the line finding, features/classification
methods, and the adaptive classifier.
So indeed the results are different: on Debian a comma is detected instead of the correct point. A single difference among nearly 4000 identical characters is acceptable in my opinion.
When I compare the hOCR results, I see more differences. 106 word results from a total of 636 differ, most of them having slightly different x_wconf
values.
Such differences can be explained by different implementations of floating point calculations in the hardware and in software libraries. If you are using Tesseract with OpenMP multithreading, the order of calculations includes more randomness, so more different results can be expected. In my test on Debian OpenMP had no effect on the results. It also did not matter whether tesseract
was built with g++
or clang
.
@stweil This particular example and in some other images containing English text have only minor differences, but when I try to run on other languages like Russian or Japanese, I see major differences in the detection.
I forgot to mention that I had tried disabling OpenMP as well and it didn't make any difference.
For example, the following image with russian text has 13 characters difference out of the 1693 characters in debian compared to that in windows or mac.
Do you think these differences are also because of different implementations of floating point calculations?
If this issue is technically not considered a bug, is it possible to document these differences in terms of some metrics like expected character error?
So indeed the results are different: on Debian a comma is detected instead of the correct point. A single difference among nearly 4000 identical characters is acceptable in my opinion.
There is also image processing with interpolation and rounding involved.
When I compare the hOCR results, I see more differences. 106 word results from a total of 636 differ, most of them having slightly different
x_wconf
values.
x_wconf
is an approximate measure (OCR is approximate reasoning at all). For testing an acceptable tolerance should be defined. Of course, a similarity measure should fulfill the triangle inequality and keep it across platforms.
@stweil This particular example and in some other images containing English text have only minor differences, but when I try to run on other languages like Russian or Japanese, I see major differences in the detection.
This can also be a difference in the quality of the trained models. If the Russian model is less precise than the English one, there are more disambiguation alternatives with similar scores and rounding/precision differences happen more often.
To isolate the reason all other influences like image quality should be the same.
For example, the following image with russian text has 13 characters difference out of the 1693 characters in debian compared to that in windows or mac.
13/1693 = 0.7% difference. In which direction? CER higher, lower, or same?
0.7% is not much. In some cases with historical pages I get CER 0.1 % and with a small variation of one parameter the results break apart with CER > 15%. Same model, same image, same version, same architecture.
If this issue is technically not considered a bug, is it possible to document these differences in terms of some metrics like expected character error?
Theoretically it should be possible, if the precision error of floats is known. Practically it can happen in many places and combine multiple, thus only measurable statistically.
I'm facing the same issue, the output for certain documents looks great on Windows, but not so great on Linux. I'm using 5.2.0 on both, and I made sure to use the latest 'best' traineddata files for the languages I needed while compiling on Linux. Unsure if the Windows installer also uses the same files.
The Windows installer installs the fast variant.
That is good to know. I will attempt to rebuild my Linux environment with the fast variants and compare. Thanks for the quick response!
Side note / question: I know most of these models are black boxes, but why is the fast version better than the best version? Is that expected?
Fast is much faster than best and typically gives similar results: sometimes best is better, but sometimes fast is better, so there is no clear winner for recognition. Training requires best, so that is the main reason why best is needed.
It's possible to install both variants simultaneously. Just use different directories (for example fast/eng.traineddata
, best/eng.traineddata
) and add the directory name in the language parameter (-l fast/eng
, -l best/eng
).
Thanks! That's very helpful.
I switched to using the fast variants in my Linux environment but I still see the Windows installation performing better. I have the same version of Tesseract: 5.2.0.20220712 on Windows and 5.2.0 on Linux.