tesseract Character confusion fix suggestion

Environment

Tesseract Version: 4.1.1
Platform: 4.15.0-122-generic #124-Ubuntu SMP

Hello, We utilize Tesseract a lot in our platform, and we most often had the following issue: For example, if we had a sequence "2032BA065" in the image, then we would get as output: "2032BA0O65". But this happens to other characters too, for example B -> B8, 5-> 5S. After some investigation and debugging, we came up with a fix where all cases (at least in our dataset) are corrected.

It happens at two time stamps very close (t, t+1) on the characters. Their confidence probabilities are too close to each other at time step t and time step t+1, compared to no confusing characters where confidence is close to 1.0 at each time step. Unfortunately, Tesseract doesn't filter out this kind of duplication between confused characters. To fix this issue, let's call P(t), P(t+1) the probability of recognized characters at consecutive time steps t and t+1 respectively.

D(t+1) = P(t+1) / P(t) + P(t+1), where D(t+1) defines the confusion metric, and iif D(t+1) < threshold then we stop and ignore the confused character.

In, src/lstm/recodebeam.cpp, between line 907 and 908, we add:

Suggested Fix:

if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
      {
        const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);

        const float ratio_scores = outputs[code] / sum_proba_prev_current;
        if (ratio_scores < 0.88f) break;
      }

The threshold 0.88 is experimentally set up, but I hope that this could be of help to address this issue in next versions and generalize well.

Unfortunately, I cannot provide any documents because we work on sensitive data.

Thank you.

Oct 30 '20 07:10 EucliTs0

Do you want to send a pull request with the suggested fix?

Oct 30 '20 08:10 stweil

What do you check code > 0 and code != 139?

Oct 30 '20 08:10 stweil

Related issues: #884, #1011, #1060, #1063, #1362, #1465, #2738.

Oct 30 '20 08:10 stweil

Do you want to send a pull request with the suggested fix?

I could create a PR yes, but the threshold might not be universal

Oct 30 '20 08:10 EucliTs0

What do you check code > 0 and code != 139?

Just want to avoid empty space and null char

Oct 30 '20 09:10 EucliTs0

Would code != null_char_ also work instead of code > 0? Where does this magic number 139 (empty space?) come from?

Oct 30 '20 09:10 stweil

I could create a PR yes, but the threshold might not be universal.

Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine?

Oct 30 '20 09:10 stweil

I could create a PR yes, but the threshold might not be universal.

Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine?

Yes we tested other values too, from 0.7 to 0.9 and found out that 0.88 behaves the best

Oct 30 '20 09:10 EucliTs0

Would code != null_char_ also work instead of code > 0? Where does this magic number 139 (empty space?) come from? In our case, code = 0 corresponds to empty (or space) : I printed the debug output of a part of string. so we get the label=0 between characters.

DECODED CHARACTER LSTM 4: 4, label=63
DECODED CHARACTER LSTM 5:  , label=0
DECODED CHARACTER LSTM 6: A, label=1

The 139 is a null char for us. Has the null_char variable always the same code mapping?

Oct 30 '20 10:10 EucliTs0

I believe it will be a different number in other traineddata files.

Oct 30 '20 19:10 amitdo

That's why I was asking.

Oct 30 '20 19:10 stweil

@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata?

I just have run a test on the TIFF files from test/testing and used this conditional:

      if (prev != nullptr && code != null_char_ && prev->code != null_char_) {

This fixed several confusions, all similar to this one:

-“I’'ve never forgotten that mo-
+“I've never forgotten that mo-

I would have expected “I’ve never forgotten that mo-.

Internally Tesseract has two preferred choices, with ' ranking less than ’:

    <span class='ocrx_cinfo' id='choice_1_119_13' title='x_confs 75.604965'>’</span>
    <span class='ocrx_cinfo' id='choice_1_119_14' title='x_confs 74.249809'>&#39;</span>

So the new code picked the wrong choice.

Oct 31 '20 17:10 stweil

https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/eng/eng.wordlist

I've i've I'VE I’ve

Oct 31 '20 20:10 amitdo

https://en.wikipedia.org/wiki/Apostrophe

Oct 31 '20 21:10 amitdo

@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata?

I just have run a test on the TIFF files from test/testing and used this conditional:
      if (prev != nullptr && code != null_char_ && prev->code != null_char_) {
This fixed several confusions, all similar to this one:
-“I’'ve never forgotten that mo-
+“I've never forgotten that mo-
I would have expected “I’ve never forgotten that mo-.

Internally Tesseract has two preferred choices, with ' ranking less than ’:
    <span class='ocrx_cinfo' id='choice_1_119_13' title='x_confs 75.604965'>’</span>
    <span class='ocrx_cinfo' id='choice_1_119_14' title='x_confs 74.249809'>&#39;</span>
So the new code picked the wrong choice.

We use the best traineddata, french language

Nov 02 '20 08:11 EucliTs0

https://en.wikipedia.org/wiki/Apostrophe

So, both apostrophes should be considered as OK in tesseract's output, right?

Nov 02 '20 08:11 EucliTs0

' is not wrong, but ’ is better and also detected in other lines without any confusion.

If there is a confusion with two alternatives of similar confidence, I'd normally take the one with higher confidence, even if it is only slightly higher (unless there are other rules like for example a dictionary which suggest to take the second alternative).

Nov 02 '20 08:11 stweil

Just to clarify, the suggested fix removes one confused character, but it is not necessarily the correct one (like the example with the apostrophe).

One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks.

Nov 03 '20 10:11 EucliTs0

One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks.

https://github.com/tesseract-ocr/tesseract/blob/5761880676639ba6845dfcfc03f9c8989c9aa23b/src/lstm/lstmrecognizer.cpp#L119

Nov 03 '20 15:11 amitdo

I hope it is ok for me to chime in and point out that this issue affects many users for some years now. Even if the proposed fix does not choose the best candidate, it is still very much an improvement over the current situation. Could someone experienced in C++ and tesseract please add a pull request to get the process started and the change reviewed?

Feb 22 '21 00:02 mb0

@stweil related to your question. "TheSeiko, do you have example images which still show this issue? We need them to test a bug fix which was suggested in #3144".

I've already posted some images to #1060. Now I've collected more images with double characters. I'm posting them below. I've marked the double characters bold.

All are tested with C:\Tesseract-OCR20201127>tesseract --version tesseract v5.0.0-alpha.20201127 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0

on Windows 10 64bit

example call: C://Tesseract-OCR20201127/tesseract D:\var\ocrvideoreader\images\tmp\20210416211033138_1618341916852_bottom.png stdout --dpi 400 --oem 1 --psm 6 -l deu+lat

Apr 16 '21 19:04 TheSeiko

US-Paläontologen haben eine Tyrannosaurus rex-,Zaáhlung" gemacht. 20210416210716551_1618580096723_bottom

Apr 16 '21 19:04 TheSeiko

Online-Vortragsreihe

Beginn ist um 9.30 Uhr. Die Teilnahme ist kostenlos. Anmeldungen sind per E-Mail an: frauenbuero@magq.linz.at erforderlich. 20210416210859703_1618393387569_main

Apr 16 '21 19:04 TheSeiko

US-Präsident Biden schlug Kremilchef Putin einen Gipfel zur Deeskalation in einem Drittland vor. 20210416211033138_1618341916852_bottom

Apr 16 '21 19:04 TheSeiko

Österreich

In einem derzeitigen Gesetzesentwurf werden Razzien im Behördenbereich beinahe verunmöjglicht.

Nach einem Treffen mit Experten ist Justizministerin Zadic bereit, entsprechende Änderungen am Entwurf vorzunehmen. 20210416211335774_1618273347138_main

Apr 16 '21 19:04 TheSeiko

Shaquille ONeal Sportskanone auf der Suche nach neuem Team!

Unser „Shagq“ ist sehr menschenbezogen, intelligent und brav. 20210416211528904_1617921093632_main

Apr 16 '21 19:04 TheSeiko

Service

Im April auf www.ibkinfo.at: Innsbruck zu Fuf$ und am Radl erkunden sowie Neues zum Rad- Masterplan. 20210416211658610_1617575639408_right

Apr 16 '21 19:04 TheSeiko

Fußball OFB-Legionáar Philipp Lienhart trifft beim 2:0-Sieg von Freiburg gegen Augsburg. 20210416211825294_1616479218850_bottom

Apr 16 '21 19:04 TheSeiko

Politik . Die SPO kritisiert das ,,|chaotische" Corona- Management der Regierung scharf. 20210416212930684_1595777062447_bottom

Apr 16 '21 19:04 TheSeiko

Kurzfilmfestival

Eine hochkarätige Aus- wahl meist dystopischer Filme, zusammengestellt von ProgrammerlInnen aus Cannes, Locarno, Sarajevo und mehr. 20210416213215294_1585967118329_right

Apr 16 '21 19:04 TheSeiko

tesseract tesseract copied to clipboard

Character confusion fix suggestion

Environment

Suggested Fix:

tesseract
tesseract copied to clipboard