tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Different results on debian machines compared to windows & mac

Open giri-kum opened this issue 2 years ago • 11 comments

Environment

  • Tesseract Version: 5.0.1
  • Commit Number: 424b17f997363670d187f42c43408c472fe55053
  • Platform: Linux girid 4.19.0-20-amd64 # 1 SMP Debian 4.19.235-1 x86_64 GNU/Linux

Current Behavior:

I am using the following version of tesseract on a debian machine.

$ tesseract --version tesseract 5.0.1-42-g424b leptonica-1.83.0 libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 Found AVX2 Found AVX Found FMA Found SSE4.1

And I am trying to match its accuracy across the three platforms (win, mac, and debian) for my application. However, I noticed that debian is producing different results compared to the other 2 platform.

I have attached a sample image in which the results were different. I tried disabling the optimizations related to AVX2, AVX, FMA and SSE and that did not work.

Expected Behavior:

Ideally, the models should produce the same result on the same image across platforms.

longScannedDoc

giri-kum avatar May 09 '22 18:05 giri-kum

I just have run OCR for that image on MacOS arm64 and Debian x86_64. Here is the result:

--- debian/167471110-60ca8f1d-9db7-4eb1-ae3a-f82b770429f3.txt	2022-05-09 20:23:07.660615639 +0200
+++ mac/167471110-60ca8f1d-9db7-4eb1-ae3a-f82b770429f3.txt	2022-05-09 20:19:07.000000000 +0200
@@ -9,7 +9,7 @@
 The Tesseract OCR engine, as was the HP Research
 Prototype in the UNLV Fourth Annual Test of OCR
 Accuracy{I}, is described in a comprehensive
-overview, Emphasis is placed on aspects that are novel
+overview. Emphasis is placed on aspects that are novel
 or at least unusual in an OCR engine, including in
 particular the line finding, features/classification
 methods, and the adaptive classifier.

So indeed the results are different: on Debian a comma is detected instead of the correct point. A single difference among nearly 4000 identical characters is acceptable in my opinion.

When I compare the hOCR results, I see more differences. 106 word results from a total of 636 differ, most of them having slightly different x_wconf values.

Such differences can be explained by different implementations of floating point calculations in the hardware and in software libraries. If you are using Tesseract with OpenMP multithreading, the order of calculations includes more randomness, so more different results can be expected. In my test on Debian OpenMP had no effect on the results. It also did not matter whether tesseract was built with g++ or clang.

stweil avatar May 09 '22 18:05 stweil

@stweil This particular example and in some other images containing English text have only minor differences, but when I try to run on other languages like Russian or Japanese, I see major differences in the detection.

I forgot to mention that I had tried disabling OpenMP as well and it didn't make any difference.

For example, the following image with russian text has 13 characters difference out of the 1693 characters in debian compared to that in windows or mac. russian

Do you think these differences are also because of different implementations of floating point calculations?

If this issue is technically not considered a bug, is it possible to document these differences in terms of some metrics like expected character error?

giri-kum avatar May 09 '22 20:05 giri-kum

So indeed the results are different: on Debian a comma is detected instead of the correct point. A single difference among nearly 4000 identical characters is acceptable in my opinion.

There is also image processing with interpolation and rounding involved.

When I compare the hOCR results, I see more differences. 106 word results from a total of 636 differ, most of them having slightly different x_wconf values.

x_wconf is an approximate measure (OCR is approximate reasoning at all). For testing an acceptable tolerance should be defined. Of course, a similarity measure should fulfill the triangle inequality and keep it across platforms.

wollmers avatar May 10 '22 11:05 wollmers

@stweil This particular example and in some other images containing English text have only minor differences, but when I try to run on other languages like Russian or Japanese, I see major differences in the detection.

This can also be a difference in the quality of the trained models. If the Russian model is less precise than the English one, there are more disambiguation alternatives with similar scores and rounding/precision differences happen more often.

To isolate the reason all other influences like image quality should be the same.

For example, the following image with russian text has 13 characters difference out of the 1693 characters in debian compared to that in windows or mac.

13/1693 = 0.7% difference. In which direction? CER higher, lower, or same?

0.7% is not much. In some cases with historical pages I get CER 0.1 % and with a small variation of one parameter the results break apart with CER > 15%. Same model, same image, same version, same architecture.

If this issue is technically not considered a bug, is it possible to document these differences in terms of some metrics like expected character error?

Theoretically it should be possible, if the precision error of floats is known. Practically it can happen in many places and combine multiple, thus only measurable statistically.

wollmers avatar May 10 '22 12:05 wollmers

I'm facing the same issue, the output for certain documents looks great on Windows, but not so great on Linux. I'm using 5.2.0 on both, and I made sure to use the latest 'best' traineddata files for the languages I needed while compiling on Linux. Unsure if the Windows installer also uses the same files.

vikrambajaj22 avatar Aug 23 '22 15:08 vikrambajaj22

The Windows installer installs the fast variant.

stweil avatar Aug 23 '22 15:08 stweil

That is good to know. I will attempt to rebuild my Linux environment with the fast variants and compare. Thanks for the quick response!

Side note / question: I know most of these models are black boxes, but why is the fast version better than the best version? Is that expected?

vikrambajaj22 avatar Aug 23 '22 16:08 vikrambajaj22

Fast is much faster than best and typically gives similar results: sometimes best is better, but sometimes fast is better, so there is no clear winner for recognition. Training requires best, so that is the main reason why best is needed.

stweil avatar Aug 23 '22 16:08 stweil

It's possible to install both variants simultaneously. Just use different directories (for example fast/eng.traineddata, best/eng.traineddata) and add the directory name in the language parameter (-l fast/eng, -l best/eng).

stweil avatar Aug 23 '22 16:08 stweil

Thanks! That's very helpful.

vikrambajaj22 avatar Aug 23 '22 16:08 vikrambajaj22

I switched to using the fast variants in my Linux environment but I still see the Windows installation performing better. I have the same version of Tesseract: 5.2.0.20220712 on Windows and 5.2.0 on Linux.

vikrambajaj22 avatar Aug 23 '22 19:08 vikrambajaj22