tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Same Version & Different Results Ubuntu vs. Mac

Open benjaminlbz opened this issue 4 years ago • 22 comments

Hi! I'm running tesseract on the same image, on my local computer (mac) and on an ec2 instance with ubuntu 20.04.

I made sure to install the same version of tesseract, see below output of tesseract --version, on my machine (on top) vs. ec2 (bottom of screenshot)

Yet, i get slightly different results when i run tesseract on the same image

Any idea why this is happening ? is there a dependency of tesseract that has a different version on ec2 vs. my laptop ?

Also, if you know if there's a better ec2 instance than ubuntu 20.04 for tesseract, please let me know

thank you!

tesseract version

benjaminlbz avatar Dec 29 '20 17:12 benjaminlbz

The obvious difference is the SIMD support in these two systems. AVX2 and FMA vs. SSE (SSE4.1).

I believe it can explain the ''slightly different results'.

Did you use exactly the same parameters and traineddata file on these two systems?

amitdo avatar Dec 29 '20 21:12 amitdo

Thank you for your response. Yes its the same code

is there a way to make the SIMD support to be the same as my local (i.e SSE only if i understand correctly?)

My goal is to replicate exactly the same output from my mac on the ec2 instance with ubuntu

thank you!

benjaminlbz avatar Dec 29 '20 23:12 benjaminlbz

You can try to remove this block from configure.ac:

https://github.com/tesseract-ocr/tesseract/blob/2fe1532926ec3ab17715e927045e88e7ae70b316/configure.ac#L137-L153

I'm not sure if it will not break compilation.

amitdo avatar Dec 30 '20 20:12 amitdo

The difference might also be caused by multithreading: the scalar product calculations are run in 4 parallel threads by default, and the exact result can depend on the scheduling of the calculation. I'd try single threaded runs first (set environment OMP_THREAD_LIMIT=1). Tesseract uses multithreading on Ubuntu by default, while on Mac it normally runs singlethreaded.

stweil avatar Dec 30 '20 20:12 stweil

@benjaminlbz : please reply also for rest of question from amitdo...

zdenop avatar Dec 31 '20 11:12 zdenop

Thank your for response. @amitdo: i'm using the same parameters and the same traineddata file. Regarding the configure.ac file, could you let me know how/where I can access this file ? @stweil : i tried setting omp thread to 1 but still got different results

thank you!

benjaminlbz avatar Jan 02 '21 16:01 benjaminlbz

@benjaminlbz, can you provide a test image which shows the problem and the complete command line which you used? Ideally it should be possible for me and others to reproduce the problem.

stweil avatar Jan 02 '21 17:01 stweil

Sure. I'm using pytesseract to get the output in a data frame but here is a simple example of an image and command line that gives slightly different output (i'm using psm 6 because that's what i'm using in the pytesseract command to get better results):

Command line: tesseract test_ocr.png test_ocr_ --psm 6

Text outputs: (see difference at the end of the file) test_ocr_mac.txt test_ocr_ubuntu.txt

test_ocr

Thank you for your help!

benjaminlbz avatar Jan 02 '21 17:01 benjaminlbz

Which traineddata use used (best/fast/tessdata)?

zdenop avatar Jan 02 '21 17:01 zdenop

I used tesseract as is after installing it, so whatever is default, my guess is tessdata ?

benjaminlbz avatar Jan 02 '21 19:01 benjaminlbz

;-) AFAIK ubuntu use fast, no clue about mac. Check filesize of traineddata . Try to use best on both installation.

zdenop avatar Jan 02 '21 20:01 zdenop

I tried using the best traineddata files on both but still got different results.I also tried using the original traineddata files from the mac in ubuntu and still different results unfortunately. Have you been able to replicate the example I sent ?

@amitdo thank you for your response. I want to try removing this part from configure.ac. Could you let me know the instructions to modify this file ?

thank you!

benjaminlbz avatar Jan 03 '21 16:01 benjaminlbz

Regarding the configure.ac file, could you let me know how/where I can access this file ?

I want to try removing this part from configure.ac. Could you let me know the instructions to modify this file ?

  • Do you have the source code? If not, download it or use git to get it.
  • In the root source directory you downloaded, search for the configure.ac file, open it. remove the lines I mentioned and save the file.
  • Compile the code.
  • Run tesseract.

If you have more questions, please use the forum.

I have limited time, so I won't be able to help you with more newbie questions.

amitdo avatar Jan 03 '21 17:01 amitdo

I can reproduce the Ubuntu result on Debian GNU Linux and the Mac result on macOS with AVX2.

Note that git master gives a result which differs to the 4.1.1 result.

stweil avatar Jan 03 '21 19:01 stweil

@benjaminlbz, I doubt that --psm 6 is a good choice for your image. It implies a single uniform block of text. Your image has three columns with misaligned lines in the different columns.

stweil avatar Jan 03 '21 20:01 stweil

@stweil fully agree for this image. But I’m mostly working on images with tables and numbers for which psm 6 works better. This image i uploaded is just an example to show the difference between mac and ubuntu

benjaminlbz avatar Jan 05 '21 12:01 benjaminlbz

stweil commented on Jan 3

I can reproduce the Ubuntu result on Debian GNU Linux and the Mac result on macOS with AVX2.

Mac with x86-64 or ARM64 CPU?

With 1 thread on both OSes?

You can try to install GCC on macOS and compile Tesseract with it. Maybe this will produce a different result.

amitdo avatar Feb 04 '21 04:02 amitdo

I have encountered the problem of different results on the same machine. There seems to be some uncleared state in the TessBaseAPI. When processing multiple files, the same order will produce the same result, but shuffling will produce different results for the same image. It makes a difference mainly in the results of the diplopia issue.

simple reproduce code with tesserocr:

from PIL import Image
from tesserocr import PyTessBaseAPI, PSM, tesseract_version

if __name__ == "__main__":
    TESSDATA_DIR="/home/nagadomi/dev/tesseract-git/tessdata_fast"
    test_image = Image.open("test1.png")

    print(tesseract_version(), "\n")

    print("* case1 API re-use")
    with PyTessBaseAPI(path=TESSDATA_DIR, lang="jpn_vert", psm=PSM.SINGLE_BLOCK_VERT_TEXT) as api:
        api.SetVariable("preserve_interword_spaces", "1")
        variants = set()
        for t in range(100):
            api.SetImage(test_image)
            text = api.GetUTF8Text()
            variants.add(text)
        print(f"{len(variants)} different results")
        print("----\n".join(variants))

    print("* case2 API re-create")
    variants = set()
    for t in range(100):
        with PyTessBaseAPI(path=TESSDATA_DIR, lang="jpn_vert", psm=PSM.SINGLE_BLOCK_VERT_TEXT) as api:
            api.SetVariable("preserve_interword_spaces", "1")
            api.SetImage(test_image)
            text = api.GetUTF8Text()
            variants.add(text)
    print(f"{len(variants)} different results")
    print("----\n".join(variants))

test1.png test1

result:

% OMP_THREAD_LIMIT=1 python3 bug_report.py
tesseract 5.0.0-alpha-20210401-130-g7a308
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 

* case1 API re-use
2 different results
パバイボ
パバイボ
パイポの
シューリンガン
----
パバイボ
パバイボ
パバイポの
シューリンガン

* case2 API re-create
1 different results
パバイボ
パバイボ
パバイポの
シューリンガン

nagadomi avatar Jun 06 '21 07:06 nagadomi

@nagadomi,

The original issue is about different results on different operating systems. You are reporting about different results on the same operating system.

Please open a new issue and copy your report to that issue.

amitdo avatar Jun 06 '21 16:06 amitdo

Any update on this issue? I am running into the same problem where I get significantly worse results on Ubuntu compared to MacOS when testing on the same image.

Kojon74 avatar Mar 24 '22 23:03 Kojon74

@Kojon74 What's the tesseract version on each os? Which traineddata is used on each? What's the hardware configuration? Please provide complete information as well as the test image.

Shreeshrii avatar Mar 25 '22 01:03 Shreeshrii

Any update on this issue? I am running into the same problem where I get significantly worse results on Ubuntu compared to MacOS when testing on the same image.

I had the same issue. I had installed tesseract on my mac using brew and it installed tesseract version 5.0.1, while on linux installing it using apt-get, it installs version 4.1.1 which gives different results. Update the tesseract to 5.x and it will give consistent results. Here is the way to update it on debian: https://techviewleo.com/install-and-use-tesseract-ocr-on-debian/

kanwarnain avatar Apr 27 '22 08:04 kanwarnain