tesseract
tesseract copied to clipboard
LSTM: Training - explicit viraama not recognized correctly
In Devanagari script, a virama is used to kill the inherent vowel of a consonant. When followed by another consonant, it forms a conjunct form. Depending on the font used, this could either be a glyph or can be represented with the explicit viraama symbol. There are times when the font may have a glyph for the conjunct but the user wants to use explicit virama. ZWNJ (U+200C) and ZWJ (U+200D) are used in various Indic scripts in relation to this.
Tesseract displays the viraama symbol when it comes at end of word (followed by space) but is not doing so when it is followed by another consonant.
Attached text file and associated box/tiff pairs in different fonts can be used for testing/training this feature.
I tried to do 'Fine Tune' LSTM training for this but get a number of errors related to Encoding of string failed! Can't encode transcription:
Some sample images of real life examples - with Hindi and Sanskrit text with explicit viraama followed by consonant
Please see 12.1 from http://www.unicode.org/versions/Unicode9.0.0/ch12.pdf for description of viraam