tessdata
tessdata copied to clipboard
Need Mongolian traineddata
I'm thinking about using tesseract on lpr how good is it?
Does any one got update to train mongolian Language ?
There are some repositories on GitHub: khangaikh/tesseract-mon, dolugen/tesseract-mnc, maybe more.
But there seems to be code missing in Tesseract for Mongolian, see ccmain/pageiterator.cpp.
http://www.alanwood.net/unicode/mongolian.html
The Mongolian range was introduced with version 3.0 of the Unicode Standard. Mongolian is the caseless script used for writing Menggu (the language of the Chinese province of Nei Menggu) and for the Manchu, Sibe and Todo languages. It was formerly used for Khalkha, the national language of Mongolia, but is now mainly restricted to religious texts, having been replaced by Cyrillic for other uses. Mongolian is written vertically from left to right.
khangaikh/tesseract-mon, dolugen/tesseract-mnc,
Both of these are for Mongolian-Cyrillic
Tesseract repos also have mon.traineddata - not sure whether it is cyrillic or otherwise.
https://github.com/tesseract-ocr/tessdata_fast/blob/master/mon.traineddata
https://github.com/tesseract-ocr/tessdata_best/blob/master/mon.traineddata
I checked the wordlist from mon.traineddata. Here is a sample from it:
Хи
Хил
Хилчид
Хилчин
Хилчний
Хилээс
Хилээр
Хилэн
Хилэнц
Хилэнцийн
Хилэнцийнхэн
Хилари
Хилл
Хиллари
So it looks like, it is Mongolian-Cyrillic.
The most recent Mongolian alphabet is a based on the Cyrillic script, more specifically the Russian alphabet plus the letters, Өө /ö/ and Үү /ü/. It was introduced in the 1940s and has been in use as the official writing system of Mongolia ever since.
ref: https://en.wikipedia.org/wiki/Mongolian_writing_systems
@Skeetfly @scubess
Were you looking for Mongolian-Cyrillic or the traditional Mongolian traineddata?
@stweil
Mongolian, written in Mongolian script is written vertically from left to right. https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/pageiterator.cpp#L543 seems related to that.
However, the mon.traineddata which is Mongolian in Cyrrilic, does not require it.
Here is sample of wordlist for Mongolian, written in Mongolian script taken from http://crubadan.org/languages/mn-Mong
ᠢᠨ 3802
ᠡᠮᠦᠨᠡᠡ 2800
ᠠ 2670
ᠰᠠᠷᠠᠠ 2083
ᠢ 1830
ᠦᠭᠡᠢ 1574
ᠳᠤ 1543
ᠡ 1501
ᠦᠨ 1453
ᠨᠢ 1422
ᠭᠠᠷᠠᠭ 1388
ᠪᠠᠢᠨᠠᠠ 1315
ᠤᠨ 1220
ᠶᠢᠨ 1178
ᠤ 1058
ᠳᠦ 1026
Related Info:
http://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Mong
https://www.ethnologue.com/language/mvf
https://groups.google.com/forum/#!msg/tesseract-ocr/EjnYPwmx7UM/lmzi37oKjQsJ how add a new language tesseract mvf.baiti.exp0.tif mvf.baiti.exp0 -l mvf batch.nochop makebox
http://www.babelstone.co.uk/Mongolian/Report170.pdf http://www.babelstone.co.uk/Mongolian/Report170A.pdf http://www.babelstone.co.uk/Mongolian/Report170B.pdf
https://r12a.github.io/mongolian-variants/ https://r12a.github.io/scripts/links?script=mongolian
@Shreeshrii @stweil Hi guys,
Thanks for your replies !As you mentioned @Shreeshrii , I am not either sure about tessdata_best mon. tranineddata file has trained traditional or Cyrillic. On the other side, I tried to integrate the mon.traineddata file for the iOS app which i am working on. So i tried to use tessdata_best mon.traineddata, but it is crashing all the time with,
actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES
I found the trained version mismatched with the tesseract engine version. Which is a different issue then what we are taking here. So i made it working on Cyrillic text data when I trained with Tesseract 3.03-rc1 (Homepage) Leptonica 1.71 (Homepage) Thanks for your reply and also based on the sample training text, i can see Mongolian Cyrillic is recognised correctly. I put it in a repo for people who are looking for Mongolian Cyrillic trained data https://github.com/scubess/Tesseract-Mongolian-Training
@Shreeshrii i will update the traineddata file with wordlist too.
@Skeetfly for lpr, you can apply regex to the recognised result from tesseract.
I Hope it's useful ...
Is there any progress in the work on traditional Mongolian?
I don't know of anyone who works on it.