tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

tesseract fails to read simple numbers

Open embeh opened this issue 1 year ago • 24 comments

Current Behavior

I am using pytesseract (which calls /usr/bin/tesseract) to recognize numbers of a gas meter. Unfortunately, this very often fails to read most numbers and is very unreliable.

The actual command to get the number string from the image is pytesseract.image_to_string(img, lang='eng', config='--dpi 70 --psm 8 -c tessedit_char_whitelist=,0123456789')

Here is an example image (after some image processing): 20240714-162351_08_ocr

When running this through tesseract (as described above), I just get "2734"... :-(

Any ideas how to improve this, given that there never will be anything but numbers from 0-9 in the image...?

Expected Behavior

Correctly read the numbers. For the image example, this should be "4428734"

Suggested Fix

No response

tesseract -v

tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Operating System

No response

Other Operating System

Ubuntu 20

uname -a

Linux myhost 4.4.0-19041-Microsoft #4355-Microsoft Thu Apr 12 17:37:00 PST 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

No response

Virtualization / Containers

Ubuntu in WSL2

Other Information

No response

embeh avatar Jul 14 '24 14:07 embeh

OK, the page segmentation mode seems to be the issue here.

Replacing --psm 8 with --psm 7 produces much better results (so does --psm 11 but none of the others) - but I have no idea why. PSM 8 is advertised as "single word...", isn't that what we have here?

embeh avatar Jul 14 '24 15:07 embeh

Why not close the issue if it's resolved?

DominicMukilan avatar Jul 16 '24 11:07 DominicMukilan

Well, I think psm 8 should be able to handle this, too, no?

embeh avatar Jul 16 '24 11:07 embeh

It is still an issue . Tessearact LSTM engine have very hard time reconizing very simple numbers while PaddlePaddleOCR Recongnize well.

OCRCut

here is the result

7% 7% 23
6 6 8

psm 8 dosen't help

Legacy engine improve for numbers but its totally screwed on alphabets.

v3ss0n avatar Jul 18 '24 19:07 v3ss0n

Hi @embeh , what kind of image processing techniques did you use?

uttaran-das avatar Aug 05 '24 20:08 uttaran-das

Hi @embeh , what kind of image processing techniques did you use?

A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?

embeh avatar Aug 06 '24 10:08 embeh

Hi @embeh , what kind of image processing techniques did you use?

A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?

Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.

uttaran-das avatar Aug 06 '24 19:08 uttaran-das

Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.

I don't really understand the motivation. If, for the same pixels, psm 7 works fine but psm 8 does not - why would a change in the image processing make a difference?

In addition, the contrast is as big as it can be: the background is pure white, the text is fully black, i.e. it is a binary image. Any grey you might see is only due to how github renders the image.

embeh avatar Aug 06 '24 20:08 embeh

PSM 8 is advertised as "single word...", isn't that what we have here?

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

If it was 'H e l l o' then you could call it a word, but inside a text line, Tesseract will still consider any big enough horizontal white space as a word separator.

amitdo avatar Aug 23 '24 18:08 amitdo

tesseract 4.1.1 is too old and we don't support it.

You said you get a better result with psm 7, but you didn't provide the output with this psm.

amitdo avatar Aug 23 '24 19:08 amitdo

tesseract 4.1.1 is too old and we don't support it.

OK. Unfortunately that seems to be the latest offered by the default Ubuntu repository (and pytesseract?).

You said you get a better result with psm 7, but you didn't provide the output with this psm.

--psm 7 produces the output "4428734" --psm 8 produces the output "4L2B734"

Both were run on the identical image file. You should be able to reproduce this by downloading the image above and run it through tesseract?

So the result is not completely wrong, and it seems not to force the result to multiple words or such. It just messes up the "4" and the "8".

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

OK. These numbers come from an analog counter (think old car's mileage counter), so they are rather "monospaced". I certainly could use image processing to squeeze them together some more but what makes me wonder is that psm 7 simply does the job without such hacks.

Don't get me wrong - I found a solution that works for me; now all I am trying is to provide feedback to help making this an even better piece of software...

embeh avatar Aug 23 '24 20:08 embeh

PSM 8 is advertised as "single word...", isn't that what we have here?

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

I just did a test and manually moved the individual digits closer to each other (without changing any of the black pixels) : image

...and you are correct! Now I get this:

--psm 7: "4428734" --psm 8: "4428734"

So both report the same correct numbers only because the spacing. Interesting!

embeh avatar Aug 23 '24 20:08 embeh

For psm 8 with the first image, let's say there is a place for improvement...

Tesseract is very popular open source software. We get a lot of questions, bug reports and suggestions, but the team is tiny (4 people currently) and we're all volunteers.

amitdo avatar Aug 23 '24 21:08 amitdo

same problem... to complex library fo simpe tasks

AlexNemets avatar Dec 15 '24 09:12 AlexNemets

there is also a problem with numbers, for example 5 008/6 002 every other time can not read a number after a space, I added exceptions.

def extract_numbers(text):

Delete all non-numeric characters except "/"

cleaned_text = re.sub(r'[^0-9/]', ", text)

Searching for numbers in XXX/YYY

match format = re.search(r'(\d+)/(\d+)', cleaned_text) if match: current = match.group(1) max_val = match.group(2) return current, max_val return None, None

config='--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789/' what could be the cause of the problem?

EvilUbi avatar Jan 12 '25 19:01 EvilUbi

Here are a several images of numbers where Tesseract misinterpreted what the values were:

Image (returned 9050) Image (returned 9053) Image (returned 3076) Image (returned 9085) Image (returned 9088) Image (returned 59142)

The options used in this case were:

ocr.SetPageSegMode(tesseract::PSM_SINGLE_LINE);
ocr.SetVariable("tessedit_char_whitelist", "0123456789");
ocr.SetVariable("classify_bln_numeric_mode", "1");  // Enable numeric mode
ocr.SetSourceResolution(300);
$ tesseract --version
tesseract 5.5.0
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.18

You can find the source code for the project here: https://github.com/philipswan/StarshipTestflightData/blob/main/extractTelemetry.cpp

philipswan avatar Mar 14 '25 08:03 philipswan

I can not reproduce problem cat run.bat :

@echo off
rem Iterate over all PNG files in the current directory
for %%i in (*.png) do (
    echo Processing %%i
    tesseract "%%i" - --psm 7 --dpi 300 -c classify_bln_numeric_mode=1 -c tessedit_char_whitelist=0123456789
)

It produces following output (e.g. no misinterpretation) :

Processing 5053.png
5053
Processing 5076.png
5076
Processing 5085.png
5085
Processing 5088.png
5088
Processing 5142.png
5142

zdenop avatar Mar 14 '25 20:03 zdenop

I also just tried it using the bash-script approach. I was able to reproduce it:

phil@LGGram3:/mnt/c/Users/phils/Documents/StarshipTestflightData/ocrFails$ ./ocr
Processing ship_speed_6298_9050_89_gray.png
9050
Processing ship_speed_6302_9053_88_gray.png
9053
Processing ship_speed_6334_3076_72_gray.png
3076
Processing ship_speed_6347_9085_82_gray.png
9085
Processing ship_speed_6352_9088_82_gray.png
9088
Processing ship_speed_6426_59142_74_gray.png
59142

(Note: All of these results are wrong - although thousands of others on similar sets of numbers are correct)

phil@LGGram3:/mnt/c/Users/phils/Documents/StarshipTestflightData/ocrFails$ cat ocr
#!/bin/bash
# Iterate over all PNG files in the current directory
for file in *.png; do
  echo "Processing $file"
  tesseract "$file" - --psm 7 --dpi 300 -c classify_bln_numeric_mode=1 -c tessedit_char_whitelist=0123456789
done

To figure out why my environment may be broken while yours apparently is not, can we compare your "version" output to mine (see above)? Also, I uploaded my history here. There may be a clue in there as to how I got my installation into a non-working state.

philipswan avatar Mar 14 '25 22:03 philipswan

What version of tesseract are you using? Where did you get the eng.traineddata file from?

amitdo avatar Mar 15 '25 06:03 amitdo

I'm on 5.5.0 - see my earlier comment above. I think I used sudo apt install --reinstall tesseract-ocr-eng -y to get the latest eng.traineddata file. A couple of other commands that may help to diagnose the issue...

phil@LGGram3:/mnt/c/Users/phils/Documents/StarshipTestflightData/ocrFails$ ldconfig -p | grep tesseract
        libtesseract.so.5 (libc6,x86-64) => /lib/x86_64-linux-gnu/libtesseract.so.5
        libtesseract.so.4 (libc6,x86-64) => /lib/x86_64-linux-gnu/libtesseract.so.4
        libtesseract.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libtesseract.so
phil@LGGram3:/mnt/c/Users/phils/Documents/StarshipTestflightData/ocrFails$ ls /usr/share/tesseract-ocr/5/tessdata/
configs  eng.traineddata  pdf.ttf  spacex.traineddata  tessconfigs

Note that after encountering these errors, I did try to create a spacex.trainingdata file for the specific font, but the file that I created turned out to be incompatible with the version of Tesseract that I'm on, so I bailed on that effort.

philipswan avatar Mar 15 '25 06:03 philipswan

First of all do not use multiple tesseract version (libtesseract.so.5 and libtesseract.so.4 Next I use models from tessdata repository. AFAIK distributions prefer to use tessdata_fast

zdenop avatar Mar 15 '25 09:03 zdenop

Can we unpack what "do not use multiple tesseract version" means a bit more? Is there is a known issue where if you have more than one version installed, that they somehow interfere with each other in a way that leads to one or both underperforming? If so, is there a thread on this topic? I seems like this is would be the kind of problem deserving of some attention from the developer community. As for using different models, I don't think that any model should fail of the simple sets of numbers that I posted. But I'd be happy to test out different models if you can provide instructions on how to do this.

philipswan avatar Mar 15 '25 19:03 philipswan

Using multiple versions of the Tesseract library risks inadvertently relying on outdated versions (without bug fixes and improvement) . This often leads to unnecessary reports of incorrect behaviour. Developers should not be expected to address this. Users must take responsibility for managing their system environment.

I don't think that any model should fail of the simple sets of numbers that I posted

Different models use different approach and features. It could leads to different results.

zdenop avatar Mar 15 '25 19:03 zdenop

I was able to get the script to correctly handle these images by downloading the "best model" using

/usr/share/tesseract-ocr/5/tessdata$ sudo wget https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata -O eng_best.traineddata

and then by adding a -l eng_best option to the tesseract command.

philipswan avatar Mar 16 '25 05:03 philipswan

Hi, I have a similar problem

I have this image, I need the four digit number as well as the date and the time (without the second). 001_20250731_090047.bmp First I crop the image in 3 boxes (number, date, time)(and I tried with 5 but the result was worse), and use a psm 7. Since it's black on white or white on black I didn't do any editing (except the cropping).

For the exemple above the result should be (1329, 07/30, 01:28), but I get (13279, 07/30, 01228). In other instance, I also had 3 becoming 2 or sometime letter, overall not accurate. And also I use the version 5.5 of Tesseract.

And lastly, I'm not a developper, so I know the basic but this conversation went a little too far for my understanding. So please, if you answer (which I would be really grateful), keep it in mind so I can understand it.

Thank you for your help

eunalecouffe avatar Aug 29 '25 08:08 eunalecouffe