tesseract
tesseract copied to clipboard
Character confusion fix suggestion
Environment
- Tesseract Version: 4.1.1
- Platform: 4.15.0-122-generic #124-Ubuntu SMP
Hello, We utilize Tesseract a lot in our platform, and we most often had the following issue: For example, if we had a sequence "2032BA065" in the image, then we would get as output: "2032BA0O65". But this happens to other characters too, for example B -> B8, 5-> 5S. After some investigation and debugging, we came up with a fix where all cases (at least in our dataset) are corrected.
It happens at two time stamps very close (t, t+1) on the characters. Their confidence probabilities are too close to each other at time step t and time step t+1, compared to no confusing characters where confidence is close to 1.0 at each time step. Unfortunately, Tesseract doesn't filter out this kind of duplication between confused characters. To fix this issue, let's call P(t), P(t+1) the probability of recognized characters at consecutive time steps t and t+1 respectively.
D(t+1) = P(t+1) / P(t) + P(t+1), where D(t+1) defines the confusion metric, and iif D(t+1) < threshold then we stop and ignore the confused character.
In, src/lstm/recodebeam.cpp, between line 907 and 908, we add:
Suggested Fix:
if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
{
const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);
const float ratio_scores = outputs[code] / sum_proba_prev_current;
if (ratio_scores < 0.88f) break;
}
The threshold 0.88 is experimentally set up, but I hope that this could be of help to address this issue in next versions and generalize well.
Unfortunately, I cannot provide any documents because we work on sensitive data.
Thank you.
Do you want to send a pull request with the suggested fix?
What do you check code > 0 and code != 139?
Related issues: #884, #1011, #1060, #1063, #1362, #1465, #2738.
Do you want to send a pull request with the suggested fix?
I could create a PR yes, but the threshold might not be universal
What do you check
code > 0andcode != 139?
Just want to avoid empty space and null char
Would code != null_char_ also work instead of code > 0? Where does this magic number 139 (empty space?) come from?
I could create a PR yes, but the threshold might not be universal.
Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine?
I could create a PR yes, but the threshold might not be universal.
Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine?
Yes we tested other values too, from 0.7 to 0.9 and found out that 0.88 behaves the best
Would
code != null_char_also work instead ofcode > 0? Where does this magic number 139 (empty space?) come from? In our case, code = 0 corresponds to empty (or space) : I printed the debug output of a part of string. so we get the label=0 between characters.
DECODED CHARACTER LSTM 4: 4, label=63
DECODED CHARACTER LSTM 5: , label=0
DECODED CHARACTER LSTM 6: A, label=1
The 139 is a null char for us. Has the null_char variable always the same code mapping?
I believe it will be a different number in other traineddata files.
That's why I was asking.
@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata?
I just have run a test on the TIFF files from test/testing and used this conditional:
if (prev != nullptr && code != null_char_ && prev->code != null_char_) {
This fixed several confusions, all similar to this one:
-“I’'ve never forgotten that mo-
+“I've never forgotten that mo-
I would have expected “I’ve never forgotten that mo-.
Internally Tesseract has two preferred choices, with ' ranking less than ’:
<span class='ocrx_cinfo' id='choice_1_119_13' title='x_confs 75.604965'>’</span>
<span class='ocrx_cinfo' id='choice_1_119_14' title='x_confs 74.249809'>'</span>
So the new code picked the wrong choice.
https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/eng/eng.wordlist
I've i've I'VE I’ve
https://en.wikipedia.org/wiki/Apostrophe
@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata?
I just have run a test on the TIFF files from test/testing and used this conditional:
if (prev != nullptr && code != null_char_ && prev->code != null_char_) {This fixed several confusions, all similar to this one:
-“I’'ve never forgotten that mo- +“I've never forgotten that mo-I would have expected
“I’ve never forgotten that mo-.Internally Tesseract has two preferred choices, with
'ranking less than’:<span class='ocrx_cinfo' id='choice_1_119_13' title='x_confs 75.604965'>’</span> <span class='ocrx_cinfo' id='choice_1_119_14' title='x_confs 74.249809'>'</span>So the new code picked the wrong choice.
We use the best traineddata, french language
https://en.wikipedia.org/wiki/Apostrophe
So, both apostrophes should be considered as OK in tesseract's output, right?
' is not wrong, but ’ is better and also detected in other lines without any confusion.
If there is a confusion with two alternatives of similar confidence, I'd normally take the one with higher confidence, even if it is only slightly higher (unless there are other rules like for example a dictionary which suggest to take the second alternative).
Just to clarify, the suggested fix removes one confused character, but it is not necessarily the correct one (like the example with the apostrophe).
One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks.
One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks.
https://github.com/tesseract-ocr/tesseract/blob/5761880676639ba6845dfcfc03f9c8989c9aa23b/src/lstm/lstmrecognizer.cpp#L119
I hope it is ok for me to chime in and point out that this issue affects many users for some years now. Even if the proposed fix does not choose the best candidate, it is still very much an improvement over the current situation. Could someone experienced in C++ and tesseract please add a pull request to get the process started and the change reviewed?
@stweil related to your question. "TheSeiko, do you have example images which still show this issue? We need them to test a bug fix which was suggested in #3144".
I've already posted some images to #1060. Now I've collected more images with double characters. I'm posting them below. I've marked the double characters bold.
All are tested with C:\Tesseract-OCR20201127>tesseract --version tesseract v5.0.0-alpha.20201127 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0
on Windows 10 64bit
example call: C://Tesseract-OCR20201127/tesseract D:\var\ocrvideoreader\images\tmp\20210416211033138_1618341916852_bottom.png stdout --dpi 400 --oem 1 --psm 6 -l deu+lat
US-Paläontologen haben eine
Tyrannosaurus rex-,Zaáhlung" gemacht.

Online-Vortragsreihe
Beginn ist um 9.30 Uhr.
Die Teilnahme ist
kostenlos. Anmeldungen
sind per E-Mail an:
frauenbuero@magq.linz.at
erforderlich.

US-Präsident Biden schlug Kremilchef Putin einen
Gipfel zur Deeskalation in einem Drittland vor.

Österreich
In einem derzeitigen Gesetzesentwurf werden Razzien im Behördenbereich beinahe verunmöjglicht.
Nach einem Treffen mit Experten ist
Justizministerin Zadic bereit, entsprechende
Änderungen am Entwurf vorzunehmen.

Shaquille ONeal Sportskanone auf der Suche nach neuem Team!
Unser „Shagq“ ist sehr
menschenbezogen,
intelligent und brav.

Service
Im April auf
www.ibkinfo.at:
Innsbruck zu Fuf$ und am
Radl erkunden sowie
Neues zum Rad-
Masterplan.

Fußball
OFB-Legionáar Philipp Lienhart trifft beim
2:0-Sieg von Freiburg gegen Augsburg.

Politik .
Die SPO kritisiert das ,,|chaotische" Corona-
Management der Regierung scharf.

Kurzfilmfestival
Eine hochkarätige Aus-
wahl meist dystopischer
Filme, zusammengestellt
von ProgrammerlInnen
aus Cannes, Locarno,
Sarajevo und mehr.
