tessdata_best Telugu unicode ambiguities

trafficstars

Hi, I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX' with font 'noto sans telugu 8pt'. I then ran it using the the testdata_best telugu language trained data. I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.

Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough) vowel 1) ఒ (pronounced as 'o' in 'so') vowel 2) ఓ (pronounced as 'oa' in 'goal' ) vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')

similar looking consonant 1) బ (pronounced as 'bu' in 'bus') consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')

Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')

Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')

Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')

Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')

Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.

Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).

Ambiguity 8: ౠ ('rroo') became బూ ('boo')

I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.

Please help to address these ambiguity resolutions.

tesseract-telugu.txt

Sep 13 '18 03:09 meta-forte

Please also test with tessdata_fast.
Check tel.lstm-unicharset in both tessdata_best and tessdata_fast to ensure that rarely used letters are included.
Take a look at the training source files in langdata_lstm repo under tel.
Verify that the indic/telugu validation rules are correct.

On Thu 13 Sep, 2018, 8:53 AM Manas Marthi, [email protected] wrote:

Hi, I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX'. I then ran it using the the testdata_best telugu language trained data. I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.

Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough) vowel 1) ఒ (pronounced as 'o' in 'so') vowel 2) ఓ (pronounced as 'oa' in 'goal' ) vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')

similar looking consonant 1) బ (pronounced as 'bu' in 'bus') consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')

Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')

Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')

Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')

Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')

Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.

Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).

Ambiguity 8: ౠ ('rroo') became బూ ('boo')

I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.

Please help to address these ambiguity resolutions.

tesseract-telugu.txt https://github.com/tesseract-ocr/tessdata_best/files/2377575/tesseract-telugu.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_best/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ow6-Hp5u_rar7PuPyzPF2xepLL3Nks5uac-xgaJpZM4Wmghi .

Sep 13 '18 06:09 Shreeshrii

Please test with real text not just syllables.

On Thu 13 Sep, 2018, 12:22 PM Shree Devi Kumar, [email protected] wrote:

Please also test with tessdata_fast.

Check tel.lstm-unicharset in both tessdata_best and tessdata_fast to ensure that rarely used letters are included.

Take a look at the training source files in langdata_lstm repo under tel.

Verify that the indic/telugu validation rules are correct.

On Thu 13 Sep, 2018, 8:53 AM Manas Marthi, [email protected] wrote:

Hi, I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX'. I then ran it using the the testdata_best telugu language trained data. I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.

Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough) vowel 1) ఒ (pronounced as 'o' in 'so') vowel 2) ఓ (pronounced as 'oa' in 'goal' ) vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')

similar looking consonant 1) బ (pronounced as 'bu' in 'bus') consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')

Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')

Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')

Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')

Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')

Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.

Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).

Ambiguity 8: ౠ ('rroo') became బూ ('boo')

I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.

Please help to address these ambiguity resolutions.

tesseract-telugu.txt https://github.com/tesseract-ocr/tessdata_best/files/2377575/tesseract-telugu.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_best/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ow6-Hp5u_rar7PuPyzPF2xepLL3Nks5uac-xgaJpZM4Wmghi .

Sep 13 '18 06:09 Shreeshrii

Thank you. I will try and update

Sep 13 '18 08:09 meta-forte

I created a word doc with valid text and converted it to pdf and then tiff using imagemagick and ran tesseract with training data fast. I was able to scan mostly okay. News paper clipping had some errors..But that's fine.

That said, the ambiguity stated in item 1,7 are still a problem.

Sep 17 '18 14:09 meta-forte

I will do more testing and update here

Sep 17 '18 14:09 meta-forte

tessdata_best tessdata_best copied to clipboard

Telugu unicode ambiguities

tessdata_best
tessdata_best copied to clipboard