tesseract
tesseract copied to clipboard
Same Version & Different Results Ubuntu vs. Mac
Hi! I'm running tesseract on the same image, on my local computer (mac) and on an ec2 instance with ubuntu 20.04.
I made sure to install the same version of tesseract, see below output of tesseract --version, on my machine (on top) vs. ec2 (bottom of screenshot)
Yet, i get slightly different results when i run tesseract on the same image
Any idea why this is happening ? is there a dependency of tesseract that has a different version on ec2 vs. my laptop ?
Also, if you know if there's a better ec2 instance than ubuntu 20.04 for tesseract, please let me know
thank you!
data:image/s3,"s3://crabby-images/985b7/985b7a878607292b109668035f7210bfbb617c94" alt="tesseract version"
The obvious difference is the SIMD support in these two systems. AVX2 and FMA vs. SSE (SSE4.1).
I believe it can explain the ''slightly different results'.
Did you use exactly the same parameters and traineddata file on these two systems?
Thank you for your response. Yes its the same code
is there a way to make the SIMD support to be the same as my local (i.e SSE only if i understand correctly?)
My goal is to replicate exactly the same output from my mac on the ec2 instance with ubuntu
thank you!
You can try to remove this block from configure.ac:
https://github.com/tesseract-ocr/tesseract/blob/2fe1532926ec3ab17715e927045e88e7ae70b316/configure.ac#L137-L153
I'm not sure if it will not break compilation.
The difference might also be caused by multithreading: the scalar product calculations are run in 4 parallel threads by default, and the exact result can depend on the scheduling of the calculation. I'd try single threaded runs first (set environment OMP_THREAD_LIMIT=1
). Tesseract uses multithreading on Ubuntu by default, while on Mac it normally runs singlethreaded.
@benjaminlbz : please reply also for rest of question from amitdo...
Thank your for response. @amitdo: i'm using the same parameters and the same traineddata file. Regarding the configure.ac file, could you let me know how/where I can access this file ? @stweil : i tried setting omp thread to 1 but still got different results
thank you!
@benjaminlbz, can you provide a test image which shows the problem and the complete command line which you used? Ideally it should be possible for me and others to reproduce the problem.
Sure. I'm using pytesseract to get the output in a data frame but here is a simple example of an image and command line that gives slightly different output (i'm using psm 6 because that's what i'm using in the pytesseract command to get better results):
Command line: tesseract test_ocr.png test_ocr_ --psm 6
Text outputs: (see difference at the end of the file) test_ocr_mac.txt test_ocr_ubuntu.txt
data:image/s3,"s3://crabby-images/66398/663984aaf4756000f9217a10e327b0b345a77450" alt="test_ocr"
Thank you for your help!
Which traineddata use used (best/fast/tessdata)?
I used tesseract as is after installing it, so whatever is default, my guess is tessdata ?
;-) AFAIK ubuntu use fast, no clue about mac. Check filesize of traineddata . Try to use best on both installation.
I tried using the best traineddata files on both but still got different results.I also tried using the original traineddata files from the mac in ubuntu and still different results unfortunately. Have you been able to replicate the example I sent ?
@amitdo thank you for your response. I want to try removing this part from configure.ac. Could you let me know the instructions to modify this file ?
thank you!
Regarding the configure.ac file, could you let me know how/where I can access this file ?
I want to try removing this part from configure.ac. Could you let me know the instructions to modify this file ?
- Do you have the source code? If not, download it or use git to get it.
- In the root source directory you downloaded, search for the configure.ac file, open it. remove the lines I mentioned and save the file.
- Compile the code.
- Run tesseract.
If you have more questions, please use the forum.
I have limited time, so I won't be able to help you with more newbie questions.
I can reproduce the Ubuntu result on Debian GNU Linux and the Mac result on macOS with AVX2.
Note that git master gives a result which differs to the 4.1.1 result.
@benjaminlbz, I doubt that --psm 6
is a good choice for your image. It implies a single uniform block of text
. Your image has three columns with misaligned lines in the different columns.
@stweil fully agree for this image. But I’m mostly working on images with tables and numbers for which psm 6 works better. This image i uploaded is just an example to show the difference between mac and ubuntu
stweil commented on Jan 3
I can reproduce the Ubuntu result on Debian GNU Linux and the Mac result on macOS with AVX2.
Mac with x86-64 or ARM64 CPU?
With 1 thread on both OSes?
You can try to install GCC on macOS and compile Tesseract with it. Maybe this will produce a different result.
I have encountered the problem of different results on the same machine. There seems to be some uncleared state in the TessBaseAPI. When processing multiple files, the same order will produce the same result, but shuffling will produce different results for the same image. It makes a difference mainly in the results of the diplopia issue.
simple reproduce code with tesserocr:
from PIL import Image
from tesserocr import PyTessBaseAPI, PSM, tesseract_version
if __name__ == "__main__":
TESSDATA_DIR="/home/nagadomi/dev/tesseract-git/tessdata_fast"
test_image = Image.open("test1.png")
print(tesseract_version(), "\n")
print("* case1 API re-use")
with PyTessBaseAPI(path=TESSDATA_DIR, lang="jpn_vert", psm=PSM.SINGLE_BLOCK_VERT_TEXT) as api:
api.SetVariable("preserve_interword_spaces", "1")
variants = set()
for t in range(100):
api.SetImage(test_image)
text = api.GetUTF8Text()
variants.add(text)
print(f"{len(variants)} different results")
print("----\n".join(variants))
print("* case2 API re-create")
variants = set()
for t in range(100):
with PyTessBaseAPI(path=TESSDATA_DIR, lang="jpn_vert", psm=PSM.SINGLE_BLOCK_VERT_TEXT) as api:
api.SetVariable("preserve_interword_spaces", "1")
api.SetImage(test_image)
text = api.GetUTF8Text()
variants.add(text)
print(f"{len(variants)} different results")
print("----\n".join(variants))
test1.png
result:
% OMP_THREAD_LIMIT=1 python3 bug_report.py
tesseract 5.0.0-alpha-20210401-130-g7a308
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
* case1 API re-use
2 different results
パバイボ
パバイボ
パイポの
シューリンガン
----
パバイボ
パバイボ
パバイポの
シューリンガン
* case2 API re-create
1 different results
パバイボ
パバイボ
パバイポの
シューリンガン
@nagadomi,
The original issue is about different results on different operating systems. You are reporting about different results on the same operating system.
Please open a new issue and copy your report to that issue.
Any update on this issue? I am running into the same problem where I get significantly worse results on Ubuntu compared to MacOS when testing on the same image.
@Kojon74 What's the tesseract version on each os? Which traineddata is used on each? What's the hardware configuration? Please provide complete information as well as the test image.
Any update on this issue? I am running into the same problem where I get significantly worse results on Ubuntu compared to MacOS when testing on the same image.
I had the same issue. I had installed tesseract on my mac using brew and it installed tesseract version 5.0.1, while on linux installing it using apt-get, it installs version 4.1.1 which gives different results. Update the tesseract to 5.x and it will give consistent results. Here is the way to update it on debian: https://techviewleo.com/install-and-use-tesseract-ocr-on-debian/