tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

LSTM: Training - explicit viraama not recognized correctly

Open Shreeshrii opened this issue 7 years ago • 2 comments

In Devanagari script, a virama is used to kill the inherent vowel of a consonant. When followed by another consonant, it forms a conjunct form. Depending on the font used, this could either be a glyph or can be represented with the explicit viraama symbol. There are times when the font may have a glyph for the conjunct but the user wants to use explicit virama. ZWNJ (U+200C) and ZWJ (U+200D) are used in various Indic scripts in relation to this.

Tesseract displays the viraama symbol when it comes at end of word (followed by space) but is not doing so when it is followed by another consonant.

Attached text file and associated box/tiff pairs in different fonts can be used for testing/training this feature.

I tried to do 'Fine Tune' LSTM training for this but get a number of errors related to Encoding of string failed! Can't encode transcription:

san.training_text_viraam.txt

san.viraama.box-tiff.zip

san shree-dv0726-ot exp0

Shreeshrii avatar Dec 23 '16 09:12 Shreeshrii

Some sample images of real life examples - with Hindi and Sanskrit text with explicit viraama followed by consonant

bg-hin-san010 bg-hin-san012

bg-hin-san003 bg-hin-san006

Shreeshrii avatar Dec 23 '16 09:12 Shreeshrii

Please see 12.1 from http://www.unicode.org/versions/Unicode9.0.0/ch12.pdf for description of viraam

Shreeshrii avatar Jan 02 '17 08:01 Shreeshrii