android-ocr icon indicating copy to clipboard operation
android-ocr copied to clipboard

German recognition of word breaks at the end of lines not treated correctly when concatenating lines

Open notYetLost opened this issue 4 years ago • 2 comments

In German word breaks at the end of some line are represented by a "-" directly after the last character of the word.

The Android App "OCR" concatenates lines appending a space after the line-break character. Otherwise OCR does a superb job!

The issue could easily be fixed with the following procedure when concatenating lines:

If a line contains a "-" at the end, check the first word of the next line:

if the word is "und" or "oder" keep the "-" and insert a space when concatenating the lines (current procedure) with other first word: when the word starts with a lower case character drop the "-" when concatenating the lines, otherwise keep the "-"; do NOT insert an additional space when concatenating the lines.

TesseractOcrAndroid-TestPage 2021-06-13 08_Tesseract-German-Recognition

notYetLost avatar Jun 13 '21 07:06 notYetLost

in some rare cases as shown in the second attachment commenting the OCR result, my suggested procedure will incorrectly drop a space: usage of "-" at the end of a line back referencing a word used in some additional context.

Example in the OCR result, where my suggestion would fail: "Microsoft- auf ein lokales Konto" - if this "-" would have been at the end of the line. However this is a very, very rare case!

notYetLost avatar Jun 13 '21 07:06 notYetLost

@notYetLost - Thank you for detailing the needed algorithm! Greetings from Berlin! :-)

pixel2user avatar Jan 01 '22 23:01 pixel2user