langdata icon indicating copy to clipboard operation
langdata copied to clipboard

Would like to help for Burmese/Myanmar language training?

Open herzcthu opened this issue 10 years ago • 57 comments

Hello, I would like to help. I've already cloned all repository. How do I start?

herzcthu avatar Jul 03 '15 20:07 herzcthu

What issue is there with Burmese/Myanmar language?

zdenop avatar Jul 03 '15 21:07 zdenop

We have 2 types of unicode font. Non standard unicode font and standard unicode font. When I check langdata files for Burmese, most words are incorrect. I guess you have generated mixed contents with non standard unicode contents and standard unicode contents. When I try to scan an image with Burmese character written in Padauk fonts, output contents are not readable. I would like to know method you've used to generate Burmese training files. Where did you get original data? I can check if it is standard unicode contents or not.

herzcthu avatar Jul 04 '15 07:07 herzcthu

I think the real issue is not only about using standard or non-standard Unicode, but also the wrong method of extracting data from the source. I mean the source data need to be segmented correctly to get a correct single word. Myanmar language users do not much care about adding a 'space' character between words; this results in false perception of two or more words as a single word, when you assume all characters between 2 'space' characters as a word. I found most word lists here ,especially bi-grams holds too long Myanmar phrases. That makes the wordlists unusable and the results of its appliction is totally unpredictable So I think you need to extract data from a source using dictionary-lookup approach. Of course, you need to build your own wordlist manually or use those made by others. Also Myanmar language is a syllable-based language; that is one or more Myanmar letters combine to form a syllable and one or more syllables join to form a word. So it is advisable to detect syllables so that you can gain much performance improvement in dictionary-looking up.

minthanthtoo avatar Aug 17 '15 09:08 minthanthtoo

@herzcthu @minthanthtoo

Please add some good sources of standard unicode fonts and sample texts and word frequency lists to https://github.com/tesseract-ocr/langdata/issues/46

Shreeshrii avatar Feb 04 '17 13:02 Shreeshrii

https://my.wikipedia.org/ All contents on wikipedia are in standard unicode font.

herzcthu avatar Feb 11 '17 13:02 herzcthu

@zdenop Issue is with training data itself. The person who prepared the data, does not know the Myanmar language. Majority of the training data has misspellings and mixed with hacked version of Myanmar Unicode as said by @herzcthu . You can imagine rice and spaghetti mixed in a bowl. Also, it is not segmented properly as @minthanthtoo pointed out. Any suggestions to on how to prepare training data?

nengine avatar Feb 13 '17 22:02 nengine

Please see Ray's comment at https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

about how the training data is being built for the 4.0 LSTM training. I don't think they are using the training_text file in langdata.

Shreeshrii avatar Feb 14 '17 04:02 Shreeshrii

Thanks @Shreeshrii ./tesstrain.sh would automatically create .tff/box pairs from langdata directory for 4.0 LSTM training?

nengine avatar Mar 06 '17 01:03 nengine

Yes. Tesstrain.sh creates tiff box pairs that can be used for LSTM training. Please see wiki pages regarding details. You need large amount of training data for good training. See Ray's comments about LSTM training process.

Shreeshrii avatar Mar 10 '17 09:03 Shreeshrii

https://github.com/tesseract-ocr/tesseract/issues/654

will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better. Myanmar in particular needs improvement, as the www data is littered with dotted circles, and the unicode book does not adequately describe the syntax for a well-formed grapheme in Myanmar (or any other language for that matter).

Shreeshrii avatar Mar 30 '17 12:03 Shreeshrii

copied from https://github.com/tesseract-ocr/langdata/issues/46

@herzcthu commented

Myanmar wordlists https://github.com/kanaung/wordlists


https://github.com/kanyawtech/myanmar-karen-word-lists/blob/master/burmese-word-list.txt?raw=true

Is this a good wordlist in standard unicode for mynamar?

Shreeshrii avatar Mar 30 '17 12:03 Shreeshrii

These are the most common words in Myanmar, but it is not a complete list. The definition of a word itself is tricky in Myanmar language because there are many ways syllables can be combined to form a word. I am not so sure how Tesseract training works, but it may be better to train on the syllables instead of a word(cluster of syllables). Which is also to say that each syllable must be first detected and then do the classification. Classifying entire word may be too difficult, unless I am not fully aware of Tesseract capabilities.

nengine avatar Mar 30 '17 15:03 nengine

Manually Constructed Context-Free Grammar For Myanmar Syllable Structure http://www.aclweb.org/anthology/E12-3004

amitdo avatar Mar 30 '17 15:03 amitdo

Representing Myanmar in Unicode Details and Examples http://unicode.org/notes/tn11/myanmar_uni-v2.pdf http://www.tuninst.net/LINGUISTICS/myanmar-unicode/myanmar-unicode.htm

Creating and Supporting OpenType Fonts for Myanmar Script https://www.microsoft.com/typography/OpenTypeDev/myanmar/intro.htm

Myanmar script notes http://rishida.net/scripts/myanmar/#shaping

https://www.researchgate.net/publication/253745697_A_Rule-based_Syllable_Segmentation_of_Myanmar_Text

amitdo avatar Mar 30 '17 16:03 amitdo

@theraysmith

I used a few words from the burmese wordlist and the landing page of wikipedia as a small training sample to test mynamar. Both of these are supposed to be in standard unicode for mynamar.

training text and generated unicharset are attached. I got a number of errors while building unicharset. Maybe the mynamar.unicharset in langdata needs to be updated???


=== Phase UP: Generating unicharset and unichar properties files ===
[Fri Mar 31 16:07:02 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.OzCvDLSWBp/mya/ /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box /tmp/tmp.OzCvDLSW
Bp/mya/mya.Myanmar_Text.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text.exp0.box
Wrote unicharset file /tmp/tmp.OzCvDLSWBp/mya//unicharset.
[Fri Mar 31 16:07:05 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -O /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -X /tmp/tmp
.OzCvDLSWBp/mya/mya.xheights --script_dir=../langdata
Loaded unicharset of size 217 from file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset
Setting unichar properties
Other case È of è is not in unicharset
Other case Ë of ë is not in unicharset
Warning: properties incomplete for index 4 = ယ်
Warning: properties incomplete for index 5 = လ်
Warning: properties incomplete for index 8 = မ်
Warning: properties incomplete for index 10 = င်
Warning: properties incomplete for index 16 = မှ
Warning: properties incomplete for index 22 = ရှ
Warning: properties incomplete for index 28 = ဖွဲ့
Warning: properties incomplete for index 30 = ည်
Warning: properties incomplete for index 36 = ပ်
Warning: properties incomplete for index 37 = ဖြ
Warning: properties incomplete for index 38 = င့်
Warning: properties incomplete for index 41 = က်
Warning: properties incomplete for index 42 = နှာ
Warning: properties incomplete for index 43 = ည်း
Warning: properties incomplete for index 44 = တ်
Warning: properties incomplete for index 45 = မှု
Warning: properties incomplete for index 47 = မ်း
Warning: properties incomplete for index 50 = ခြ
Warning: properties incomplete for index 51 = င်း
Warning: properties incomplete for index 52 = ကြော
Warning: properties incomplete for index 53 = နှို
Warning: properties incomplete for index 54 = ချွ
Warning: properties incomplete for index 63 = ပွဲ
Warning: properties incomplete for index 64 = တွေ
Warning: properties incomplete for index 65 = မှာ
Warning: properties incomplete for index 66 = ဆွေး
Warning: properties incomplete for index 67 = နွေး
Warning: properties incomplete for index 73 = ထွေ
Warning: properties incomplete for index 78 = မြ
Warning: properties incomplete for index 79 = စ်
Warning: properties incomplete for index 80 = မြို့
Warning: properties incomplete for index 83 = န်
Warning: properties incomplete for index 86 = ကွ
Warning: properties incomplete for index 89 = သွ
Warning: properties incomplete for index 92 = ဖ်
Warning: properties incomplete for index 96 = ခြေ
Warning: properties incomplete for index 100 = မျှ
Warning: properties incomplete for index 101 = ဂြို
Warning: properties incomplete for index 102 = ဟ်
Warning: properties incomplete for index 103 = တွ
Warning: properties incomplete for index 110 = ရှု
Warning: properties incomplete for index 119 = ညွှ
Warning: properties incomplete for index 120 = န်း
Warning: properties incomplete for index 123 = ကြ
Warning: properties incomplete for index 124 = ည့်
Warning: properties incomplete for index 125 = နှ
Warning: properties incomplete for index 126 = ထွ
Warning: properties incomplete for index 130 = ရှိ
Warning: properties incomplete for index 132 = ကြို
Warning: properties incomplete for index 140 = ဉ်
Warning: properties incomplete for index 150 = လှ
Warning: properties incomplete for index 151 = သွား
Warning: properties incomplete for index 153 = ထွာ
Warning: properties incomplete for index 154 = ထွား
Warning: properties incomplete for index 157 = ဖွံ့
Warning: properties incomplete for index 158 = မွ
Warning: properties incomplete for index 159 = လျော်
Warning: properties incomplete for index 162 = ပြော
Warning: properties incomplete for index 163 = ထွေး
Warning: properties incomplete for index 164 = ယှ
Warning: properties incomplete for index 168 = ဘွား
Warning: properties incomplete for index 179 = လွ
Warning: properties incomplete for index 182 = န့်
Warning: properties incomplete for index 189 = စွဲ
Warning: properties incomplete for index 192 = ပြီး
Warning: properties incomplete for index 197 = မြေ
Warning: properties incomplete for index 202 = ကွာ
Warning: properties incomplete for index 210 = ရှာ
Warning: properties incomplete for index 211 = ဖွေ
Warning: properties incomplete for index 212 = တွေ့
Warning: properties incomplete for index 214 = ပြ
Warning: properties incomplete for index 215 = ကြာ
Writing unicharset to file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset

mya.Myanmar_Text.exp0.txt mya.Myanmar_Text_Bold.exp0.txt mya.unicharset.txt

Shreeshrii avatar Mar 31 '17 10:03 Shreeshrii

@herzcthu @nengine @minthanthtoo

Please take a look at https://github.com/tesseract-ocr/langdata/blob/master/Myanmar.unicharset in light of the above warning messages. Do you notice any pattern for the errors?

Tesseract does train on syllables (for Indic languages) AFAIK. Please see https://github.com/tesseract-ocr/langdata/files/885327/mya.unicharset.txt generated from the two training files - all listed in the message above.

Shreeshrii avatar Mar 31 '17 11:03 Shreeshrii

@theraysmith do zwj and zwnj also have to be part of unicharset?

also see http://archive.mmgeeks.com/index.php?p=/discussion/379/zwnj-and-zwj

Shreeshrii avatar Mar 31 '17 11:03 Shreeshrii

https://github.com/khzaw/awesome-myanmar-unicode

amitdo avatar Mar 31 '17 11:03 amitdo

Syllabification, Normalization and Lexicographic Ordering of Myanmar Texts using Formal Approaches

http://ir.nagaokaut.ac.jp/dspace/bitstream/10649/729/1/k709.pdf

Shreeshrii avatar Mar 31 '17 11:03 Shreeshrii

I do not see consistent pattern.

  1. Warning: properties incomplete for index 4 = ယ် . ယ် by itself does not have any meaning, but when it is combined with ဘ which becomes ဘယ် it makes sense.

  2. Warning: properties incomplete for index 16 = မှ . မှ by itself does make sense and has a meaning, but not so sure why it is giving a warning.

Myanmar.unicharset clearly does not include these syllables shown in the warnings, but just consonants, vowels, etc.

It is suppose to include all syllable combinations in Myanmar.unicharset ? How does it work for Telugu for example?

nengine avatar Mar 31 '17 13:03 nengine

I don't think it is supposed to include all syllable combinations in Myanmar.unicharset but it should have all vowels, consonants, vowel signs.

I see three ranges for mynamar, first seems to be there in the unicharset, part of second and none of third.

Can you please check whether all of these are required?

http://www.alanwood.net/unicode/myanmar.html

http://www.alanwood.net/unicode/myanmar-extended-a.html

http://www.alanwood.net/unicode/myanmar-extended-b.html

Shreeshrii avatar Mar 31 '17 14:03 Shreeshrii

There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.

nengine avatar Mar 31 '17 17:03 nengine

you would rarely see them on the web.

What about in books / documents that need to be OCRed?

Shreeshrii avatar Mar 31 '17 17:03 Shreeshrii

Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web.


From: Shreeshrii [email protected] Sent: Friday, March 31, 2017 1:22 PM To: tesseract-ocr/langdata Cc: nengine; Mention Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13)

you would rarely see them on the web.

What about in books / documents that need to be OCRed?

  • excuse the brevity, sent from mobile

On 31-Mar-2017 10:35 PM, "nengine" [email protected] wrote:

There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290770704, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290774836, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3.

nengine avatar Mar 31 '17 17:03 nengine

Please take a look at this reference: http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf Table 16-3. The text says "Characters occur in the relative order shown in Table 16-3" which I do not believe to be completely correct. Part of the problem is that a lot of the characters are not even in this table! Although it is possible to guess which group the extensions belong to, I'm not convinced I have it correct. I have some code that implements this table plus my guesses to add the extensions, but it isn't ready for committing to github just yet.

The problem is that I need to exclude the incorrectly formatted text (that uses the non-standard fonts), but be sure that no correctly formatted text is dropped.

On Fri, Mar 31, 2017 at 10:49 AM, nengine [email protected] wrote:

Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web.


From: Shreeshrii [email protected] Sent: Friday, March 31, 2017 1:22 PM To: tesseract-ocr/langdata Cc: nengine; Mention Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13)

you would rarely see them on the web.

What about in books / documents that need to be OCRed?

  • excuse the brevity, sent from mobile

On 31-Mar-2017 10:35 PM, "nengine" [email protected] wrote:

There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/tesseract-ocr/langdata/issues/13# issuecomment-290770704>, or mute the thread <https://github.com/notifications/unsubscribe- auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3> .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/ tesseract-ocr/langdata/issues/13#issuecomment-290774836>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3>.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290781430, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cZdXfwKdFL1EpH01k8FRXDHh6NTks5rrTyNgaJpZM4FRqc3 .

-- Ray.

theraysmith avatar Apr 14 '17 23:04 theraysmith

I've checked characters in Myanmar.unicharset file. All characters seem correct.

herzcthu avatar Apr 27 '17 03:04 herzcthu

Please see https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315133403

When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match

For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct.

Shreeshrii avatar Jul 14 '17 05:07 Shreeshrii

Please see code at: https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp

On Thu, Jul 13, 2017 at 10:21 PM, Shreeshrii [email protected] wrote:

Please see tesseract-ocr/tesseract#995 (comment) https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315133403

For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-315272798, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056YM0MGz07l7tSTpJWUPO5bNE1W6rks5sNvrygaJpZM4FRqc3 .

-- Ray.

theraysmith avatar Jul 14 '17 18:07 theraysmith

@herzcthu @nengine @minthanthtoo

Please test with the new traineddata in tessdata/best directory and provide feedback.

Shreeshrii avatar Aug 08 '17 01:08 Shreeshrii

I'm testing new traineddata. It has improved a lot. Almost 98% correct. I will test more in detail and will provide feedback in detail later.

herzcthu avatar Aug 09 '17 17:08 herzcthu