tessdata_best
tessdata_best copied to clipboard
Telugu unicode ambiguities
Hi, I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX' with font 'noto sans telugu 8pt'. I then ran it using the the testdata_best telugu language trained data. I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.
Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough) vowel 1) ఒ (pronounced as 'o' in 'so') vowel 2) ఓ (pronounced as 'oa' in 'goal' ) vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')
similar looking consonant 1) బ (pronounced as 'bu' in 'bus') consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')
Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')
Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')
Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')
Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')
Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.
Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).
Ambiguity 8: ౠ ('rroo') became బూ ('boo')
I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.
Please help to address these ambiguity resolutions.
-
Please also test with tessdata_fast.
-
Check tel.lstm-unicharset in both tessdata_best and tessdata_fast to ensure that rarely used letters are included.
-
Take a look at the training source files in langdata_lstm repo under tel.
-
Verify that the indic/telugu validation rules are correct.
On Thu 13 Sep, 2018, 8:53 AM Manas Marthi, [email protected] wrote:
Hi, I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX'. I then ran it using the the testdata_best telugu language trained data. I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.
Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough) vowel 1) ఒ (pronounced as 'o' in 'so') vowel 2) ఓ (pronounced as 'oa' in 'goal' ) vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')
similar looking consonant 1) బ (pronounced as 'bu' in 'bus') consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')
Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')
Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')
Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')
Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')
Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.
Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).
Ambiguity 8: ౠ ('rroo') became బూ ('boo')
I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.
Please help to address these ambiguity resolutions.
tesseract-telugu.txt https://github.com/tesseract-ocr/tessdata_best/files/2377575/tesseract-telugu.txt
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_best/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ow6-Hp5u_rar7PuPyzPF2xepLL3Nks5uac-xgaJpZM4Wmghi .
Please test with real text not just syllables.
On Thu 13 Sep, 2018, 12:22 PM Shree Devi Kumar, [email protected] wrote:
Please also test with tessdata_fast.
Check tel.lstm-unicharset in both tessdata_best and tessdata_fast to ensure that rarely used letters are included.
Take a look at the training source files in langdata_lstm repo under tel.
Verify that the indic/telugu validation rules are correct.
On Thu 13 Sep, 2018, 8:53 AM Manas Marthi, [email protected] wrote:
Hi, I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX'. I then ran it using the the testdata_best telugu language trained data. I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.
Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough) vowel 1) ఒ (pronounced as 'o' in 'so') vowel 2) ఓ (pronounced as 'oa' in 'goal' ) vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')
similar looking consonant 1) బ (pronounced as 'bu' in 'bus') consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')
Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')
Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')
Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')
Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')
Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.
Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).
Ambiguity 8: ౠ ('rroo') became బూ ('boo')
I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.
Please help to address these ambiguity resolutions.
tesseract-telugu.txt https://github.com/tesseract-ocr/tessdata_best/files/2377575/tesseract-telugu.txt
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_best/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ow6-Hp5u_rar7PuPyzPF2xepLL3Nks5uac-xgaJpZM4Wmghi .
Thank you. I will try and update
I created a word doc with valid text and converted it to pdf and then tiff using imagemagick and ran tesseract with training data fast. I was able to scan mostly okay. News paper clipping had some errors..But that's fine.
That said, the ambiguity stated in item 1,7 are still a problem.
I will do more testing and update here