tessdata icon indicating copy to clipboard operation
tessdata copied to clipboard

Need Mongolian traineddata

Open Skeetfly opened this issue 7 years ago • 9 comments

I'm thinking about using tesseract on lpr how good is it?

Skeetfly avatar Nov 16 '17 07:11 Skeetfly

Does any one got update to train mongolian Language ?

scubess avatar Feb 08 '18 17:02 scubess

There are some repositories on GitHub: khangaikh/tesseract-mon, dolugen/tesseract-mnc, maybe more.

But there seems to be code missing in Tesseract for Mongolian, see ccmain/pageiterator.cpp.

stweil avatar Feb 17 '18 16:02 stweil

http://www.alanwood.net/unicode/mongolian.html

The Mongolian range was introduced with version 3.0 of the Unicode Standard. Mongolian is the caseless script used for writing Menggu (the language of the Chinese province of Nei Menggu) and for the Manchu, Sibe and Todo languages. It was formerly used for Khalkha, the national language of Mongolia, but is now mainly restricted to religious texts, having been replaced by Cyrillic for other uses. Mongolian is written vertically from left to right.

khangaikh/tesseract-mon, dolugen/tesseract-mnc,

Both of these are for Mongolian-Cyrillic

Tesseract repos also have mon.traineddata - not sure whether it is cyrillic or otherwise.

https://github.com/tesseract-ocr/tessdata_fast/blob/master/mon.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/master/mon.traineddata

Shreeshrii avatar Feb 17 '18 18:02 Shreeshrii

I checked the wordlist from mon.traineddata. Here is a sample from it:

Хи
Хил
Хилчид
Хилчин
Хилчний
Хилээс
Хилээр
Хилэн
Хилэнц
Хилэнцийн
Хилэнцийнхэн
Хилари
Хилл
Хиллари

So it looks like, it is Mongolian-Cyrillic.

The most recent Mongolian alphabet is a based on the Cyrillic script, more specifically the Russian alphabet plus the letters, Өө /ö/ and Үү /ü/. It was introduced in the 1940s and has been in use as the official writing system of Mongolia ever since.

ref: https://en.wikipedia.org/wiki/Mongolian_writing_systems

@Skeetfly @scubess

Were you looking for Mongolian-Cyrillic or the traditional Mongolian traineddata?

Shreeshrii avatar Feb 18 '18 12:02 Shreeshrii

@stweil

Mongolian, written in Mongolian script is written vertically from left to right. https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/pageiterator.cpp#L543 seems related to that.

However, the mon.traineddata which is Mongolian in Cyrrilic, does not require it.

Here is sample of wordlist for Mongolian, written in Mongolian script taken from http://crubadan.org/languages/mn-Mong

ᠢᠨ 3802
ᠡᠮᠦᠨᠡᠡ 2800
ᠠ 2670
ᠰᠠᠷᠠᠠ 2083
ᠢ 1830
ᠦᠭᠡᠢ 1574
ᠳᠤ 1543
ᠡ 1501
ᠦᠨ 1453
ᠨᠢ 1422
ᠭᠠᠷᠠᠭ 1388
ᠪᠠᠢᠨᠠᠠ 1315
ᠤᠨ 1220
ᠶᠢᠨ 1178
ᠤ 1058
ᠳᠦ 1026

Shreeshrii avatar Feb 18 '18 12:02 Shreeshrii

Related Info:

http://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Mong

https://www.ethnologue.com/language/mvf

https://groups.google.com/forum/#!msg/tesseract-ocr/EjnYPwmx7UM/lmzi37oKjQsJ how add a new language tesseract mvf.baiti.exp0.tif mvf.baiti.exp0 -l mvf batch.nochop makebox

http://www.babelstone.co.uk/Mongolian/Report170.pdf http://www.babelstone.co.uk/Mongolian/Report170A.pdf http://www.babelstone.co.uk/Mongolian/Report170B.pdf

https://r12a.github.io/mongolian-variants/ https://r12a.github.io/scripts/links?script=mongolian

Shreeshrii avatar Feb 18 '18 12:02 Shreeshrii

@Shreeshrii @stweil Hi guys,

Thanks for your replies !As you mentioned @Shreeshrii , I am not either sure about tessdata_best mon. tranineddata file has trained traditional or Cyrillic. On the other side, I tried to integrate the mon.traineddata file for the iOS app which i am working on. So i tried to use tessdata_best mon.traineddata, but it is crashing all the time with,

actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES

I found the trained version mismatched with the tesseract engine version. Which is a different issue then what we are taking here. So i made it working on Cyrillic text data when I trained with Tesseract 3.03-rc1 (Homepage) Leptonica 1.71 (Homepage) Thanks for your reply and also based on the sample training text, i can see Mongolian Cyrillic is recognised correctly. I put it in a repo for people who are looking for Mongolian Cyrillic trained data https://github.com/scubess/Tesseract-Mongolian-Training

@Shreeshrii i will update the traineddata file with wordlist too.

@Skeetfly for lpr, you can apply regex to the recognised result from tesseract.

I Hope it's useful ...

scubess avatar Feb 18 '18 21:02 scubess

Is there any progress in the work on traditional Mongolian?

suyie001 avatar Mar 01 '24 12:03 suyie001

I don't know of anyone who works on it.

stweil avatar Mar 01 '24 13:03 stweil