tessdata icon indicating copy to clipboard operation
tessdata copied to clipboard

Require Feedback Regarding "DV" or Dhivehi language. (Low translation accuracy also needs proper font)

Open Xayaan opened this issue 8 years ago • 33 comments

~~Tessdata had Dhivehi language but its missing now.~~ Edit : I've tested it, thanks to @amitdo got links to the docs i needed. However the accuracy of the translation is far too low as i've read online here. I'm currently looking for helpers who can join me in translating the language upto 100% also will be getting in touch the Dhivehi Academy regarding this.

Xayaan avatar Dec 28 '16 11:12 Xayaan

https://github.com/tesseract-ocr/tesseract/blob/9c7e99b041/training/language-specific.sh#L32

amitdo avatar Dec 28 '16 12:12 amitdo

div.traineddata was added to the repo https://github.com/tesseract-ocr/tessdata/blob/master/best/div.traineddata

amitdo avatar Aug 01 '17 19:08 amitdo

@Xayaan Have you tried the 'best' version? Any feedback?

Shreeshrii avatar Aug 18 '17 11:08 Shreeshrii

@Shreeshrii yes i have tried and the results are pretty bad. It needs to be trained intensively.

Xayaan avatar Aug 18 '17 15:08 Xayaan

Do not close the issue then, you can change title to say feedback regarding Dhivehi and then add some notes regarding what is wrong so that it can be improved.

Also see https://github.com/tesseract-ocr/langdata/issues/52

Shreeshrii avatar Aug 18 '17 15:08 Shreeshrii

Done, thank you! 👍

Xayaan avatar Aug 18 '17 16:08 Xayaan

Are these fonts suitable for Dhivehi ?

http://www.hassanhameed.com/?page_id=152

http://www.wazu.jp/gallery/Fonts_Thaana.html

Shreeshrii avatar Aug 18 '17 16:08 Shreeshrii

Yes, but I'd trust these : https://dhivehi.mv/fonts/

Xayaan avatar Aug 18 '17 16:08 Xayaan

https://dv.wikipedia.org/wiki/%DE%89%DE%A6%DE%87%DE%A8_%DE%9E%DE%A6%DE%8A%DE%B0%DE%99%DE%A7

@theraysmith This script looks similar to Arabic with accents. Have you had success in adding the accented version for next training?

Shreeshrii avatar Aug 18 '17 16:08 Shreeshrii

There is also Thaana traineddata

amitdo avatar Aug 18 '17 17:08 amitdo

@Xayaan please check with Thaana traineddata also.

If possible, provide an image and it's corresponding ground truth file for testing.

Shreeshrii avatar Aug 20 '17 14:08 Shreeshrii

Yes, it is. It is similiar to sanskrit and arabic. A RTL language.

I checked with the thaana traineddata, its not very accurate and has low accuracy now.

Xayaan avatar Aug 24 '17 12:08 Xayaan

any pointers to the training data used for Thaana?

Sofwath avatar Oct 02 '17 11:10 Sofwath

The langdata repo has not been updated for 4.0x

https://github.com/tesseract-ocr/langdata/tree/master/tha has the old training files

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 2, 2017 at 4:33 PM, Sofwath [email protected] wrote:

any pointers to the training data used for Thaana?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333503606, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7xHoqhI_E9OqhBtcoCFvEw1RJj2ks5soML3gaJpZM4LW3EA .

Shreeshrii avatar Oct 02 '17 11:10 Shreeshrii

it looks like that is for Thai. How about for Thaana (div)

Sofwath avatar Oct 02 '17 11:10 Sofwath

Sorry about that.

Looks like https://github.com/tesseract-ocr/langdata/tree/master/div does not have all the required files.

If it is similar to Arabic, you can copy langdata files from there and modify for Thaana.

http://crubadan.org/languages/dv could be a source for wordlists, training text.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 2, 2017 at 4:42 PM, Sofwath [email protected] wrote:

it looks like that is for Thai. How about for Thaana (div)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333505266, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-XS8WRBIq1Q-l9tkRTgsmzrCrMNks5soMUvgaJpZM4LW3EA .

Shreeshrii avatar Oct 02 '17 11:10 Shreeshrii

Link to Thaana (div) monogram and bigram file

https://github.com/Sofwath/thaanaOCR/tree/master/data

This is a Thaana text corpus

https://www.dropbox.com/s/04ox44rfuqm5xhw/dv_MV_1.txt?dl=0

Anything else that we need to have for a basic training ?

Sofwath avatar Oct 02 '17 11:10 Sofwath

You can download

https://github.com/tesseract-ocr/tessdata_best/blob/master/div.traineddata and https://github.com/tesseract-ocr/tessdata_best/blob/master/Thaana.traineddata

Then unpack the traineddata to get the files.

root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best# combine_tessdata -u div.traineddata div. Extracting tessdata components from div.traineddata Wrote div.lstm Wrote div.lstm-punc-dawg Wrote div.lstm-word-dawg Wrote div.lstm-number-dawg Wrote div.lstm-unicharset Wrote div.lstm-recoder Wrote div.version Version string:4.00.00alpha:div:synth20170629 17:lstm:size=3218139, offset=192 18:lstm-punc-dawg:size=4506, offset=3218331 19:lstm-word-dawg:size=1342450, offset=3222837 20:lstm-number-dawg:size=426, offset=4565287 21:lstm-unicharset:size=7276, offset=4565713 22:lstm-recoder:size=1093, offset=4572989 23:version:size=30, offset=4574082

root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best# combine_tessdata -u Thaana.traineddata Thaana. Extracting tessdata components from Thaana.traineddata Wrote Thaana.lstm Wrote Thaana.lstm-punc-dawg Wrote Thaana.lstm-word-dawg Wrote Thaana.lstm-number-dawg Wrote Thaana.lstm-unicharset Wrote Thaana.lstm-recoder Wrote Thaana.version Version string:4.00.00alpha:Thaana:synth20170629 17:lstm:size=7723707, offset=192 18:lstm-punc-dawg:size=5674, offset=7723899 19:lstm-word-dawg:size=5036906, offset=7729573 20:lstm-number-dawg:size=4762, offset=12766479 21:lstm-unicharset:size=10741, offset=12771241 22:lstm-recoder:size=1633, offset=12781982 23:version:size=33, offset=12783615 root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best#

You can further get the original wordlists by using dawg2wordlist

But the actual training_text will not be there.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 2, 2017 at 4:49 PM, ShreeDevi Kumar [email protected] wrote:

Sorry about that.

Looks like https://github.com/tesseract-ocr/langdata/tree/master/div does not have all the required files.

If it is similar to Arabic, you can copy langdata files from there and modify for Thaana.

http://crubadan.org/languages/dv could be a source for wordlists, training text.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 2, 2017 at 4:42 PM, Sofwath [email protected] wrote:

it looks like that is for Thai. How about for Thaana (div)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333505266, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-XS8WRBIq1Q-l9tkRTgsmzrCrMNks5soMUvgaJpZM4LW3EA .

Shreeshrii avatar Oct 02 '17 11:10 Shreeshrii

Great. Thanks. Will work on that.

Sofwath avatar Oct 02 '17 11:10 Sofwath

dawg2wordlist syntax will be as follows

$ dawg2wordlist Thaana.lstm-unicharset Thaana.lstm-word-dawg Thaana.wordlist Loading word list from Thaana.lstm-word-dawg Reading squished dawg Word list loaded.

similarly for punc and numbers.

You can review these files for accuracy.

I don't think tesseract uses unigrams and bigrams for training, though they maybe used internally at Google to generate a representative training text.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 2, 2017 at 4:58 PM, Sofwath [email protected] wrote:

Great. Thanks. Will work on that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333508159, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1GllmCT36G1YsifziNv4OYtmh95ks5soMjxgaJpZM4LW3EA .

Shreeshrii avatar Oct 02 '17 11:10 Shreeshrii

FYI Thaana files will have both English and Divehi. div files will have only Divehi.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 2, 2017 at 5:03 PM, ShreeDevi Kumar [email protected] wrote:

dawg2wordlist syntax will be as follows

$ dawg2wordlist Thaana.lstm-unicharset Thaana.lstm-word-dawg Thaana.wordlist Loading word list from Thaana.lstm-word-dawg Reading squished dawg Word list loaded.

similarly for punc and numbers.

You can review these files for accuracy.

I don't think tesseract uses unigrams and bigrams for training, though they maybe used internally at Google to generate a representative training text.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 2, 2017 at 4:58 PM, Sofwath [email protected] wrote:

Great. Thanks. Will work on that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333508159, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1GllmCT36G1YsifziNv4OYtmh95ks5soMjxgaJpZM4LW3EA .

Shreeshrii avatar Oct 02 '17 11:10 Shreeshrii

question: do i still need to create the box files even if we are using the lstm method?

Sofwath avatar Oct 03 '17 07:10 Sofwath

You have to use tesstrain.sh script file, also see tesstrain_utils.sh and language_specific.sh in training directory.

These create the box/tiff files from the training text and specified fonts. They are used for creating the lstmf files and are kept only in the tmp directory.

Try the training tutorial for english and look at the log file and tmp directory.

On 03-Oct-2017 1:18 PM, "Sofwath" [email protected] wrote:

question: do i still need to create the box files even if we are using the lstm method?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333766387, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1DHqtoDtJIAK-fbOp_K6FyUy6W_ks5soea9gaJpZM4LW3EA .

Shreeshrii avatar Oct 03 '17 08:10 Shreeshrii

Any help? I am getting this error

sudo /Users/sofwath/tesseract/training/tesstrain.sh --fonts_dir /Users/sofwath/dev/MLAI/tesseract/font/ --lang div --linedata_only --noextract_font_properties --langdata_dir langdata --tessdata_dir tessdata/ --output_dir divtrain/

=== Starting training for language 'div' mktemp: illegal option -- - usage: mktemp [-d] [-q] [-t prefix] [-u] template ... mktemp [-d] [-q] [-u] -t prefix [Wed Oct 4 10:49:19 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=

=== Phase I: Generating training images === Rendering using MV Typewriter [Wed Oct 4 10:49:20 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir= --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable

Sofwath avatar Oct 04 '17 05:10 Sofwath

You are getting error

mktemp: illegal option -- - usage: mktemp [-d] [-q] [-t prefix] [-u] template ... mktemp [-d] [-q] [-u] -t prefix


see tesstrain_utils.sh lines 29 and 172

training uses the /tmp directory for creating files

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Oct 4, 2017 at 11:20 AM, Sofwath [email protected] wrote:

Any help? I am getting this error

sudo /Users/sofwath/tesseract/training/tesstrain.sh --fonts_dir /Users/sofwath/dev/MLAI/tesseract/font/ --lang div --linedata_only --noextract_font_properties --langdata_dir langdata --tessdata_dir tessdata/ --output_dir divtrain/

=== Starting training for language 'div' mktemp: illegal option -- - usage: mktemp [-d] [-q] [-t prefix] [-u] template ... mktemp [-d] [-q] [-u] -t prefix [Wed Oct 4 10:49:19 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=

=== Phase I: Generating training images === Rendering using MV Typewriter [Wed Oct 4 10:49:20 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir= --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_ n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-334056209, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4EjWq8HH-tVDxPhm-dAtx2zepsCks5soxyOgaJpZM4LW3EA .

Shreeshrii avatar Oct 04 '17 08:10 Shreeshrii

changed export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX) to export FONT_CONFIG_CACHE=$(mktemp -d -tmpdir font_tmp.XXXXXXXXXX) in tesstrain_utils.sh and the first error was fixed but still get

=== Starting training for language 'div' /Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197: ${sample_path}: ambiguous redirect [Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx '--text' option is missing!

=== Phase I: Generating training images === Rendering using MV Typewriter [Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text '--text' option is missing! ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable

Sofwath avatar Oct 04 '17 09:10 Sofwath

You have to look at your paths

'--text' option is missing!

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Oct 4, 2017 at 2:37 PM, Sofwath [email protected] wrote:

changed export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX) to export FONT_CONFIG_CACHE=$(mktemp -d -tmpdir font_tmp.XXXXXXXXXX) in tesstrain_utils.sh and the first error was fixed but still get

=== Starting training for language 'div' /Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197: ${sample_path}: ambiguous redirect [Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --text=/var/folders/zz/ zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --fontconfig_tmpdir=/var/ folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx '--text' option is missing!

=== Phase I: Generating training images === Rendering using MV Typewriter [Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_ n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_ n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text '--text' option is missing! ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-334094997, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-7BrowjvEOfly1dSuXAx6OHLr75ks5so0rSgaJpZM4LW3EA .

Shreeshrii avatar Oct 04 '17 10:10 Shreeshrii

I followed this example

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only
--noextract_font_properties --langdata_dir ../langdata
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

Sofwath avatar Oct 04 '17 10:10 Sofwath

for tesstrain.sh there is no --text command line option

Sofwath avatar Oct 04 '17 10:10 Sofwath

/Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197: ${sample_path}: ambiguous redirect

If you are changing the bash script, you have to make sure it is done correctly. Please look at the error messages you get.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Oct 4, 2017 at 3:57 PM, Sofwath [email protected] wrote:

for tesstrain.sh there is no --text command line option

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-334114359, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxi-RmXHecvmFfDzyTeePmC2DLf6ks5so12ogaJpZM4LW3EA .

Shreeshrii avatar Oct 04 '17 10:10 Shreeshrii


=== Starting training for language 'div'
[Thu Oct 5 19:39:30 DST 2017] /usr/local/bin/text2image --fonts_dir=/mnt/c/Windows/Fonts --font=MV Typewriter --outputbase=/tmp/font_tmp.v2PwMI2E8F/sample_text.txt --text=/tmp/font_tmp.v2PwMI2E8F/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.v2PwMI2E8F
Rendered page 0 to file /tmp/font_tmp.v2PwMI2E8F/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using MV Typewriter
[Thu Oct 5 19:40:32 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.v2PwMI2E8F --fonts_dir=/mnt/c/Windows/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=../langdata/div/div.training_text
Rendered page 0 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif
Rendered page 1 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif
Rendered page 2 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Thu Oct 5 19:40:42 DST 2017] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.jFPtcB8yoM/div/div.unicharset --norm_mode 2 /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.box
Extracting unicharset from box file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.box
Word started with a combiner:0x7b0
Normalization failed for string 'ްނ'
Word started with a combiner:0x7aa
Normalization failed for string 'ުށ'
Word started with a combiner:0x7ac
Normalization failed for string 'ެފ'
Word started with a combiner:0x7b0
Normalization failed for string 'ްށ'
Word started with a combiner:0x7ae
Normalization failed for string 'ޮކ'
Word started with a combiner:0x7b0

I was able to run the program. But there are errors. See attached log file.

tesstrain.log.txt

Shreeshrii avatar Oct 05 '17 14:10 Shreeshrii

I've been trying on Mac OS . giving too many errors on the bash scripts. I'll try to run the process on Linux

Sofwath avatar Oct 06 '17 05:10 Sofwath

@Sofwath is the dhivehi training data usable now or is it abandoned? i have not seen any updates regarding this since 2017

nashrafeeg avatar Aug 19 '22 00:08 nashrafeeg