tesseract
tesseract copied to clipboard
tesseract fails to read simple numbers
Current Behavior
I am using pytesseract (which calls /usr/bin/tesseract) to recognize numbers of a gas meter.
Unfortunately, this very often fails to read most numbers and is very unreliable.
The actual command to get the number string from the image is
pytesseract.image_to_string(img, lang='eng', config='--dpi 70 --psm 8 -c tessedit_char_whitelist=,0123456789')
Here is an example image (after some image processing):
When running this through tesseract (as described above), I just get "2734"... :-(
Any ideas how to improve this, given that there never will be anything but numbers from 0-9 in the image...?
Expected Behavior
Correctly read the numbers. For the image example, this should be "4428734"
Suggested Fix
No response
tesseract -v
tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Operating System
No response
Other Operating System
Ubuntu 20
uname -a
Linux myhost 4.4.0-19041-Microsoft #4355-Microsoft Thu Apr 12 17:37:00 PST 2024 x86_64 x86_64 x86_64 GNU/Linux
Compiler
No response
CPU
No response
Virtualization / Containers
Ubuntu in WSL2
Other Information
No response
OK, the page segmentation mode seems to be the issue here.
Replacing --psm 8 with --psm 7 produces much better results (so does --psm 11 but none of the others) - but I have no idea why.
PSM 8 is advertised as "single word...", isn't that what we have here?
Why not close the issue if it's resolved?
Well, I think psm 8 should be able to handle this, too, no?
It is still an issue . Tessearact LSTM engine have very hard time reconizing very simple numbers while PaddlePaddleOCR Recongnize well.
here is the result
7% 7% 23
6 6 8
psm 8 dosen't help
Legacy engine improve for numbers but its totally screwed on alphabets.
Hi @embeh , what kind of image processing techniques did you use?
Hi @embeh , what kind of image processing techniques did you use?
A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?
Hi @embeh , what kind of image processing techniques did you use?
A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?
Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.
Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.
I don't really understand the motivation. If, for the same pixels, psm 7 works fine but psm 8 does not - why would a change in the image processing make a difference?
In addition, the contrast is as big as it can be: the background is pure white, the text is fully black, i.e. it is a binary image. Any grey you might see is only due to how github renders the image.
PSM 8 is advertised as "single word...", isn't that what we have here?
What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.
If it was 'H e l l o' then you could call it a word, but inside a text line, Tesseract will still consider any big enough horizontal white space as a word separator.
tesseract 4.1.1 is too old and we don't support it.
You said you get a better result with psm 7, but you didn't provide the output with this psm.
tesseract 4.1.1 is too old and we don't support it.
OK. Unfortunately that seems to be the latest offered by the default Ubuntu repository (and pytesseract?).
You said you get a better result with psm 7, but you didn't provide the output with this psm.
--psm 7 produces the output "4428734" --psm 8 produces the output "4L2B734"
Both were run on the identical image file. You should be able to reproduce this by downloading the image above and run it through tesseract?
So the result is not completely wrong, and it seems not to force the result to multiple words or such. It just messes up the "4" and the "8".
What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.
OK. These numbers come from an analog counter (think old car's mileage counter), so they are rather "monospaced". I certainly could use image processing to squeeze them together some more but what makes me wonder is that psm 7 simply does the job without such hacks.
Don't get me wrong - I found a solution that works for me; now all I am trying is to provide feedback to help making this an even better piece of software...
PSM 8 is advertised as "single word...", isn't that what we have here?
What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.
I just did a test and manually moved the individual digits closer to each other (without changing any of the black pixels) :
...and you are correct! Now I get this:
--psm 7: "4428734" --psm 8: "4428734"
So both report the same correct numbers only because the spacing. Interesting!
For psm 8 with the first image, let's say there is a place for improvement...
Tesseract is very popular open source software. We get a lot of questions, bug reports and suggestions, but the team is tiny (4 people currently) and we're all volunteers.
same problem... to complex library fo simpe tasks
there is also a problem with numbers, for example 5 008/6 002 every other time can not read a number after a space, I added exceptions.
def extract_numbers(text):
Delete all non-numeric characters except "/"
cleaned_text = re.sub(r'[^0-9/]', ", text)
Searching for numbers in XXX/YYY
match format = re.search(r'(\d+)/(\d+)', cleaned_text) if match: current = match.group(1) max_val = match.group(2) return current, max_val return None, None
config='--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789/' what could be the cause of the problem?
Here are a several images of numbers where Tesseract misinterpreted what the values were:
(returned 9050)
(returned 9053)
(returned 3076)
(returned 9085)
(returned 9088)
(returned 59142)
The options used in this case were:
ocr.SetPageSegMode(tesseract::PSM_SINGLE_LINE);
ocr.SetVariable("tessedit_char_whitelist", "0123456789");
ocr.SetVariable("classify_bln_numeric_mode", "1"); // Enable numeric mode
ocr.SetSourceResolution(300);
$ tesseract --version
tesseract 5.5.0
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.18
You can find the source code for the project here: https://github.com/philipswan/StarshipTestflightData/blob/main/extractTelemetry.cpp
I can not reproduce problem cat run.bat :
@echo off
rem Iterate over all PNG files in the current directory
for %%i in (*.png) do (
echo Processing %%i
tesseract "%%i" - --psm 7 --dpi 300 -c classify_bln_numeric_mode=1 -c tessedit_char_whitelist=0123456789
)
It produces following output (e.g. no misinterpretation) :
Processing 5053.png
5053
Processing 5076.png
5076
Processing 5085.png
5085
Processing 5088.png
5088
Processing 5142.png
5142
I also just tried it using the bash-script approach. I was able to reproduce it:
phil@LGGram3:/mnt/c/Users/phils/Documents/StarshipTestflightData/ocrFails$ ./ocr
Processing ship_speed_6298_9050_89_gray.png
9050
Processing ship_speed_6302_9053_88_gray.png
9053
Processing ship_speed_6334_3076_72_gray.png
3076
Processing ship_speed_6347_9085_82_gray.png
9085
Processing ship_speed_6352_9088_82_gray.png
9088
Processing ship_speed_6426_59142_74_gray.png
59142
(Note: All of these results are wrong - although thousands of others on similar sets of numbers are correct)
phil@LGGram3:/mnt/c/Users/phils/Documents/StarshipTestflightData/ocrFails$ cat ocr
#!/bin/bash
# Iterate over all PNG files in the current directory
for file in *.png; do
echo "Processing $file"
tesseract "$file" - --psm 7 --dpi 300 -c classify_bln_numeric_mode=1 -c tessedit_char_whitelist=0123456789
done
To figure out why my environment may be broken while yours apparently is not, can we compare your "version" output to mine (see above)? Also, I uploaded my history here. There may be a clue in there as to how I got my installation into a non-working state.
What version of tesseract are you using? Where did you get the eng.traineddata file from?
I'm on 5.5.0 - see my earlier comment above. I think I used sudo apt install --reinstall tesseract-ocr-eng -y to get the latest eng.traineddata file.
A couple of other commands that may help to diagnose the issue...
phil@LGGram3:/mnt/c/Users/phils/Documents/StarshipTestflightData/ocrFails$ ldconfig -p | grep tesseract
libtesseract.so.5 (libc6,x86-64) => /lib/x86_64-linux-gnu/libtesseract.so.5
libtesseract.so.4 (libc6,x86-64) => /lib/x86_64-linux-gnu/libtesseract.so.4
libtesseract.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libtesseract.so
phil@LGGram3:/mnt/c/Users/phils/Documents/StarshipTestflightData/ocrFails$ ls /usr/share/tesseract-ocr/5/tessdata/
configs eng.traineddata pdf.ttf spacex.traineddata tessconfigs
Note that after encountering these errors, I did try to create a spacex.trainingdata file for the specific font, but the file that I created turned out to be incompatible with the version of Tesseract that I'm on, so I bailed on that effort.
First of all do not use multiple tesseract version (libtesseract.so.5 and libtesseract.so.4
Next I use models from tessdata repository. AFAIK distributions prefer to use tessdata_fast
Can we unpack what "do not use multiple tesseract version" means a bit more? Is there is a known issue where if you have more than one version installed, that they somehow interfere with each other in a way that leads to one or both underperforming? If so, is there a thread on this topic? I seems like this is would be the kind of problem deserving of some attention from the developer community. As for using different models, I don't think that any model should fail of the simple sets of numbers that I posted. But I'd be happy to test out different models if you can provide instructions on how to do this.
Using multiple versions of the Tesseract library risks inadvertently relying on outdated versions (without bug fixes and improvement) . This often leads to unnecessary reports of incorrect behaviour. Developers should not be expected to address this. Users must take responsibility for managing their system environment.
I don't think that any model should fail of the simple sets of numbers that I posted
Different models use different approach and features. It could leads to different results.
I was able to get the script to correctly handle these images by downloading the "best model" using
/usr/share/tesseract-ocr/5/tessdata$ sudo wget https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata -O eng_best.traineddata
and then by adding a -l eng_best option to the tesseract command.
Hi, I have a similar problem
I have this image, I need the four digit number as well as the date and the time (without the second). 001_20250731_090047.bmp First I crop the image in 3 boxes (number, date, time)(and I tried with 5 but the result was worse), and use a psm 7. Since it's black on white or white on black I didn't do any editing (except the cropping).
For the exemple above the result should be (1329, 07/30, 01:28), but I get (13279, 07/30, 01228). In other instance, I also had 3 becoming 2 or sometime letter, overall not accurate. And also I use the version 5.5 of Tesseract.
And lastly, I'm not a developper, so I know the basic but this conversation went a little too far for my understanding. So please, if you answer (which I would be really grateful), keep it in mind so I can understand it.
Thank you for your help