Would like to help for Burmese/Myanmar language training?
Hello, I would like to help. I've already cloned all repository. How do I start?
What issue is there with Burmese/Myanmar language?
We have 2 types of unicode font. Non standard unicode font and standard unicode font. When I check langdata files for Burmese, most words are incorrect. I guess you have generated mixed contents with non standard unicode contents and standard unicode contents. When I try to scan an image with Burmese character written in Padauk fonts, output contents are not readable. I would like to know method you've used to generate Burmese training files. Where did you get original data? I can check if it is standard unicode contents or not.
I think the real issue is not only about using standard or non-standard Unicode, but also the wrong method of extracting data from the source. I mean the source data need to be segmented correctly to get a correct single word. Myanmar language users do not much care about adding a 'space' character between words; this results in false perception of two or more words as a single word, when you assume all characters between 2 'space' characters as a word. I found most word lists here ,especially bi-grams holds too long Myanmar phrases. That makes the wordlists unusable and the results of its appliction is totally unpredictable So I think you need to extract data from a source using dictionary-lookup approach. Of course, you need to build your own wordlist manually or use those made by others. Also Myanmar language is a syllable-based language; that is one or more Myanmar letters combine to form a syllable and one or more syllables join to form a word. So it is advisable to detect syllables so that you can gain much performance improvement in dictionary-looking up.
@herzcthu @minthanthtoo
Please add some good sources of standard unicode fonts and sample texts and word frequency lists to https://github.com/tesseract-ocr/langdata/issues/46
https://my.wikipedia.org/ All contents on wikipedia are in standard unicode font.
@zdenop Issue is with training data itself. The person who prepared the data, does not know the Myanmar language. Majority of the training data has misspellings and mixed with hacked version of Myanmar Unicode as said by @herzcthu . You can imagine rice and spaghetti mixed in a bowl. Also, it is not segmented properly as @minthanthtoo pointed out. Any suggestions to on how to prepare training data?
Please see Ray's comment at https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
about how the training data is being built for the 4.0 LSTM training. I don't think they are using the training_text file in langdata.
Thanks @Shreeshrii ./tesstrain.sh would automatically create .tff/box pairs from langdata directory for 4.0 LSTM training?
Yes. Tesstrain.sh creates tiff box pairs that can be used for LSTM training. Please see wiki pages regarding details. You need large amount of training data for good training. See Ray's comments about LSTM training process.
https://github.com/tesseract-ocr/tesseract/issues/654
will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better. Myanmar in particular needs improvement, as the www data is littered with dotted circles, and the unicode book does not adequately describe the syntax for a well-formed grapheme in Myanmar (or any other language for that matter).
copied from https://github.com/tesseract-ocr/langdata/issues/46
@herzcthu commented
Myanmar wordlists https://github.com/kanaung/wordlists
https://github.com/kanyawtech/myanmar-karen-word-lists/blob/master/burmese-word-list.txt?raw=true
Is this a good wordlist in standard unicode for mynamar?
These are the most common words in Myanmar, but it is not a complete list. The definition of a word itself is tricky in Myanmar language because there are many ways syllables can be combined to form a word. I am not so sure how Tesseract training works, but it may be better to train on the syllables instead of a word(cluster of syllables). Which is also to say that each syllable must be first detected and then do the classification. Classifying entire word may be too difficult, unless I am not fully aware of Tesseract capabilities.
Manually Constructed Context-Free Grammar For Myanmar Syllable Structure http://www.aclweb.org/anthology/E12-3004
Representing Myanmar in Unicode Details and Examples http://unicode.org/notes/tn11/myanmar_uni-v2.pdf http://www.tuninst.net/LINGUISTICS/myanmar-unicode/myanmar-unicode.htm
Creating and Supporting OpenType Fonts for Myanmar Script https://www.microsoft.com/typography/OpenTypeDev/myanmar/intro.htm
Myanmar script notes http://rishida.net/scripts/myanmar/#shaping
https://www.researchgate.net/publication/253745697_A_Rule-based_Syllable_Segmentation_of_Myanmar_Text
@theraysmith
I used a few words from the burmese wordlist and the landing page of wikipedia as a small training sample to test mynamar. Both of these are supposed to be in standard unicode for mynamar.
training text and generated unicharset are attached. I got a number of errors while building unicharset. Maybe the mynamar.unicharset in langdata needs to be updated???
=== Phase UP: Generating unicharset and unichar properties files ===
[Fri Mar 31 16:07:02 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.OzCvDLSWBp/mya/ /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box /tmp/tmp.OzCvDLSW
Bp/mya/mya.Myanmar_Text.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text.exp0.box
Wrote unicharset file /tmp/tmp.OzCvDLSWBp/mya//unicharset.
[Fri Mar 31 16:07:05 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -O /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -X /tmp/tmp
.OzCvDLSWBp/mya/mya.xheights --script_dir=../langdata
Loaded unicharset of size 217 from file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset
Setting unichar properties
Other case È of è is not in unicharset
Other case Ë of ë is not in unicharset
Warning: properties incomplete for index 4 = ယ်
Warning: properties incomplete for index 5 = လ်
Warning: properties incomplete for index 8 = မ်
Warning: properties incomplete for index 10 = င်
Warning: properties incomplete for index 16 = မှ
Warning: properties incomplete for index 22 = ရှ
Warning: properties incomplete for index 28 = ဖွဲ့
Warning: properties incomplete for index 30 = ည်
Warning: properties incomplete for index 36 = ပ်
Warning: properties incomplete for index 37 = ဖြ
Warning: properties incomplete for index 38 = င့်
Warning: properties incomplete for index 41 = က်
Warning: properties incomplete for index 42 = နှာ
Warning: properties incomplete for index 43 = ည်း
Warning: properties incomplete for index 44 = တ်
Warning: properties incomplete for index 45 = မှု
Warning: properties incomplete for index 47 = မ်း
Warning: properties incomplete for index 50 = ခြ
Warning: properties incomplete for index 51 = င်း
Warning: properties incomplete for index 52 = ကြော
Warning: properties incomplete for index 53 = နှို
Warning: properties incomplete for index 54 = ချွ
Warning: properties incomplete for index 63 = ပွဲ
Warning: properties incomplete for index 64 = တွေ
Warning: properties incomplete for index 65 = မှာ
Warning: properties incomplete for index 66 = ဆွေး
Warning: properties incomplete for index 67 = နွေး
Warning: properties incomplete for index 73 = ထွေ
Warning: properties incomplete for index 78 = မြ
Warning: properties incomplete for index 79 = စ်
Warning: properties incomplete for index 80 = မြို့
Warning: properties incomplete for index 83 = န်
Warning: properties incomplete for index 86 = ကွ
Warning: properties incomplete for index 89 = သွ
Warning: properties incomplete for index 92 = ဖ်
Warning: properties incomplete for index 96 = ခြေ
Warning: properties incomplete for index 100 = မျှ
Warning: properties incomplete for index 101 = ဂြို
Warning: properties incomplete for index 102 = ဟ်
Warning: properties incomplete for index 103 = တွ
Warning: properties incomplete for index 110 = ရှု
Warning: properties incomplete for index 119 = ညွှ
Warning: properties incomplete for index 120 = န်း
Warning: properties incomplete for index 123 = ကြ
Warning: properties incomplete for index 124 = ည့်
Warning: properties incomplete for index 125 = နှ
Warning: properties incomplete for index 126 = ထွ
Warning: properties incomplete for index 130 = ရှိ
Warning: properties incomplete for index 132 = ကြို
Warning: properties incomplete for index 140 = ဉ်
Warning: properties incomplete for index 150 = လှ
Warning: properties incomplete for index 151 = သွား
Warning: properties incomplete for index 153 = ထွာ
Warning: properties incomplete for index 154 = ထွား
Warning: properties incomplete for index 157 = ဖွံ့
Warning: properties incomplete for index 158 = မွ
Warning: properties incomplete for index 159 = လျော်
Warning: properties incomplete for index 162 = ပြော
Warning: properties incomplete for index 163 = ထွေး
Warning: properties incomplete for index 164 = ယှ
Warning: properties incomplete for index 168 = ဘွား
Warning: properties incomplete for index 179 = လွ
Warning: properties incomplete for index 182 = န့်
Warning: properties incomplete for index 189 = စွဲ
Warning: properties incomplete for index 192 = ပြီး
Warning: properties incomplete for index 197 = မြေ
Warning: properties incomplete for index 202 = ကွာ
Warning: properties incomplete for index 210 = ရှာ
Warning: properties incomplete for index 211 = ဖွေ
Warning: properties incomplete for index 212 = တွေ့
Warning: properties incomplete for index 214 = ပြ
Warning: properties incomplete for index 215 = ကြာ
Writing unicharset to file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset
mya.Myanmar_Text.exp0.txt mya.Myanmar_Text_Bold.exp0.txt mya.unicharset.txt
@herzcthu @nengine @minthanthtoo
Please take a look at https://github.com/tesseract-ocr/langdata/blob/master/Myanmar.unicharset in light of the above warning messages. Do you notice any pattern for the errors?
Tesseract does train on syllables (for Indic languages) AFAIK. Please see https://github.com/tesseract-ocr/langdata/files/885327/mya.unicharset.txt generated from the two training files - all listed in the message above.
@theraysmith do zwj and zwnj also have to be part of unicharset?
also see http://archive.mmgeeks.com/index.php?p=/discussion/379/zwnj-and-zwj
https://github.com/khzaw/awesome-myanmar-unicode
Syllabification, Normalization and Lexicographic Ordering of Myanmar Texts using Formal Approaches
http://ir.nagaokaut.ac.jp/dspace/bitstream/10649/729/1/k709.pdf
I do not see consistent pattern.
-
Warning: properties incomplete for index 4 = ယ် . ယ် by itself does not have any meaning, but when it is combined with ဘ which becomes ဘယ် it makes sense.
-
Warning: properties incomplete for index 16 = မှ . မှ by itself does make sense and has a meaning, but not so sure why it is giving a warning.
Myanmar.unicharset clearly does not include these syllables shown in the warnings, but just consonants, vowels, etc.
It is suppose to include all syllable combinations in Myanmar.unicharset ? How does it work for Telugu for example?
I don't think it is supposed to include all syllable combinations in Myanmar.unicharset but it should have all vowels, consonants, vowel signs.
I see three ranges for mynamar, first seems to be there in the unicharset, part of second and none of third.
Can you please check whether all of these are required?
http://www.alanwood.net/unicode/myanmar.html
http://www.alanwood.net/unicode/myanmar-extended-a.html
http://www.alanwood.net/unicode/myanmar-extended-b.html
There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.
you would rarely see them on the web.
What about in books / documents that need to be OCRed?
Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web.
From: Shreeshrii [email protected] Sent: Friday, March 31, 2017 1:22 PM To: tesseract-ocr/langdata Cc: nengine; Mention Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13)
you would rarely see them on the web.
What about in books / documents that need to be OCRed?
- excuse the brevity, sent from mobile
On 31-Mar-2017 10:35 PM, "nengine" [email protected] wrote:
There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290770704, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3 .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290774836, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3.
Please take a look at this reference: http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf Table 16-3. The text says "Characters occur in the relative order shown in Table 16-3" which I do not believe to be completely correct. Part of the problem is that a lot of the characters are not even in this table! Although it is possible to guess which group the extensions belong to, I'm not convinced I have it correct. I have some code that implements this table plus my guesses to add the extensions, but it isn't ready for committing to github just yet.
The problem is that I need to exclude the incorrectly formatted text (that uses the non-standard fonts), but be sure that no correctly formatted text is dropped.
On Fri, Mar 31, 2017 at 10:49 AM, nengine [email protected] wrote:
Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web.
From: Shreeshrii [email protected] Sent: Friday, March 31, 2017 1:22 PM To: tesseract-ocr/langdata Cc: nengine; Mention Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13)
you would rarely see them on the web.
What about in books / documents that need to be OCRed?
- excuse the brevity, sent from mobile
On 31-Mar-2017 10:35 PM, "nengine" [email protected] wrote:
There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/tesseract-ocr/langdata/issues/13# issuecomment-290770704>, or mute the thread <https://github.com/notifications/unsubscribe- auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3> .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/ tesseract-ocr/langdata/issues/13#issuecomment-290774836>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3>.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290781430, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cZdXfwKdFL1EpH01k8FRXDHh6NTks5rrTyNgaJpZM4FRqc3 .
-- Ray.
I've checked characters in Myanmar.unicharset file. All characters seem correct.
Please see https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315133403
When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match
For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct.
Please see code at: https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp
On Thu, Jul 13, 2017 at 10:21 PM, Shreeshrii [email protected] wrote:
Please see tesseract-ocr/tesseract#995 (comment) https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315133403
For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-315272798, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056YM0MGz07l7tSTpJWUPO5bNE1W6rks5sNvrygaJpZM4FRqc3 .
-- Ray.
@herzcthu @nengine @minthanthtoo
Please test with the new traineddata in tessdata/best directory and provide feedback.
I'm testing new traineddata. It has improved a lot. Almost 98% correct. I will test more in detail and will provide feedback in detail later.