tessdata
tessdata copied to clipboard
Require Feedback Regarding "DV" or Dhivehi language. (Low translation accuracy also needs proper font)
~~Tessdata had Dhivehi language but its missing now.~~ Edit : I've tested it, thanks to @amitdo got links to the docs i needed. However the accuracy of the translation is far too low as i've read online here. I'm currently looking for helpers who can join me in translating the language upto 100% also will be getting in touch the Dhivehi Academy regarding this.
https://github.com/tesseract-ocr/tesseract/blob/9c7e99b041/training/language-specific.sh#L32
div.traineddata was added to the repo https://github.com/tesseract-ocr/tessdata/blob/master/best/div.traineddata
@Xayaan Have you tried the 'best' version? Any feedback?
@Shreeshrii yes i have tried and the results are pretty bad. It needs to be trained intensively.
Do not close the issue then, you can change title to say feedback regarding Dhivehi and then add some notes regarding what is wrong so that it can be improved.
Also see https://github.com/tesseract-ocr/langdata/issues/52
Done, thank you! 👍
Are these fonts suitable for Dhivehi ?
http://www.hassanhameed.com/?page_id=152
http://www.wazu.jp/gallery/Fonts_Thaana.html
Yes, but I'd trust these : https://dhivehi.mv/fonts/
https://dv.wikipedia.org/wiki/%DE%89%DE%A6%DE%87%DE%A8_%DE%9E%DE%A6%DE%8A%DE%B0%DE%99%DE%A7
@theraysmith This script looks similar to Arabic with accents. Have you had success in adding the accented version for next training?
There is also Thaana traineddata
@Xayaan please check with Thaana traineddata also.
If possible, provide an image and it's corresponding ground truth file for testing.
Yes, it is. It is similiar to sanskrit and arabic. A RTL language.
I checked with the thaana traineddata, its not very accurate and has low accuracy now.
any pointers to the training data used for Thaana?
The langdata repo has not been updated for 4.0x
https://github.com/tesseract-ocr/langdata/tree/master/tha has the old training files
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:33 PM, Sofwath [email protected] wrote:
any pointers to the training data used for Thaana?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333503606, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7xHoqhI_E9OqhBtcoCFvEw1RJj2ks5soML3gaJpZM4LW3EA .
it looks like that is for Thai. How about for Thaana (div)
Sorry about that.
Looks like https://github.com/tesseract-ocr/langdata/tree/master/div does not have all the required files.
If it is similar to Arabic, you can copy langdata files from there and modify for Thaana.
http://crubadan.org/languages/dv could be a source for wordlists, training text.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:42 PM, Sofwath [email protected] wrote:
it looks like that is for Thai. How about for Thaana (div)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333505266, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-XS8WRBIq1Q-l9tkRTgsmzrCrMNks5soMUvgaJpZM4LW3EA .
Link to Thaana (div) monogram and bigram file
https://github.com/Sofwath/thaanaOCR/tree/master/data
This is a Thaana text corpus
https://www.dropbox.com/s/04ox44rfuqm5xhw/dv_MV_1.txt?dl=0
Anything else that we need to have for a basic training ?
You can download
https://github.com/tesseract-ocr/tessdata_best/blob/master/div.traineddata and https://github.com/tesseract-ocr/tessdata_best/blob/master/Thaana.traineddata
Then unpack the traineddata to get the files.
root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best# combine_tessdata -u div.traineddata div. Extracting tessdata components from div.traineddata Wrote div.lstm Wrote div.lstm-punc-dawg Wrote div.lstm-word-dawg Wrote div.lstm-number-dawg Wrote div.lstm-unicharset Wrote div.lstm-recoder Wrote div.version Version string:4.00.00alpha:div:synth20170629 17:lstm:size=3218139, offset=192 18:lstm-punc-dawg:size=4506, offset=3218331 19:lstm-word-dawg:size=1342450, offset=3222837 20:lstm-number-dawg:size=426, offset=4565287 21:lstm-unicharset:size=7276, offset=4565713 22:lstm-recoder:size=1093, offset=4572989 23:version:size=30, offset=4574082
root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best# combine_tessdata -u Thaana.traineddata Thaana. Extracting tessdata components from Thaana.traineddata Wrote Thaana.lstm Wrote Thaana.lstm-punc-dawg Wrote Thaana.lstm-word-dawg Wrote Thaana.lstm-number-dawg Wrote Thaana.lstm-unicharset Wrote Thaana.lstm-recoder Wrote Thaana.version Version string:4.00.00alpha:Thaana:synth20170629 17:lstm:size=7723707, offset=192 18:lstm-punc-dawg:size=5674, offset=7723899 19:lstm-word-dawg:size=5036906, offset=7729573 20:lstm-number-dawg:size=4762, offset=12766479 21:lstm-unicharset:size=10741, offset=12771241 22:lstm-recoder:size=1633, offset=12781982 23:version:size=33, offset=12783615 root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best#
You can further get the original wordlists by using dawg2wordlist
But the actual training_text will not be there.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:49 PM, ShreeDevi Kumar [email protected] wrote:
Sorry about that.
Looks like https://github.com/tesseract-ocr/langdata/tree/master/div does not have all the required files.
If it is similar to Arabic, you can copy langdata files from there and modify for Thaana.
http://crubadan.org/languages/dv could be a source for wordlists, training text.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:42 PM, Sofwath [email protected] wrote:
it looks like that is for Thai. How about for Thaana (div)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333505266, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-XS8WRBIq1Q-l9tkRTgsmzrCrMNks5soMUvgaJpZM4LW3EA .
Great. Thanks. Will work on that.
dawg2wordlist syntax will be as follows
$ dawg2wordlist Thaana.lstm-unicharset Thaana.lstm-word-dawg Thaana.wordlist Loading word list from Thaana.lstm-word-dawg Reading squished dawg Word list loaded.
similarly for punc and numbers.
You can review these files for accuracy.
I don't think tesseract uses unigrams and bigrams for training, though they maybe used internally at Google to generate a representative training text.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:58 PM, Sofwath [email protected] wrote:
Great. Thanks. Will work on that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333508159, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1GllmCT36G1YsifziNv4OYtmh95ks5soMjxgaJpZM4LW3EA .
FYI Thaana files will have both English and Divehi. div files will have only Divehi.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 5:03 PM, ShreeDevi Kumar [email protected] wrote:
dawg2wordlist syntax will be as follows
$ dawg2wordlist Thaana.lstm-unicharset Thaana.lstm-word-dawg Thaana.wordlist Loading word list from Thaana.lstm-word-dawg Reading squished dawg Word list loaded.
similarly for punc and numbers.
You can review these files for accuracy.
I don't think tesseract uses unigrams and bigrams for training, though they maybe used internally at Google to generate a representative training text.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:58 PM, Sofwath [email protected] wrote:
Great. Thanks. Will work on that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333508159, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1GllmCT36G1YsifziNv4OYtmh95ks5soMjxgaJpZM4LW3EA .
question: do i still need to create the box files even if we are using the lstm method?
You have to use tesstrain.sh script file, also see tesstrain_utils.sh and language_specific.sh in training directory.
These create the box/tiff files from the training text and specified fonts. They are used for creating the lstmf files and are kept only in the tmp directory.
Try the training tutorial for english and look at the log file and tmp directory.
On 03-Oct-2017 1:18 PM, "Sofwath" [email protected] wrote:
question: do i still need to create the box files even if we are using the lstm method?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-333766387, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1DHqtoDtJIAK-fbOp_K6FyUy6W_ks5soea9gaJpZM4LW3EA .
Any help? I am getting this error
sudo /Users/sofwath/tesseract/training/tesstrain.sh --fonts_dir /Users/sofwath/dev/MLAI/tesseract/font/ --lang div --linedata_only --noextract_font_properties --langdata_dir langdata --tessdata_dir tessdata/ --output_dir divtrain/
=== Starting training for language 'div' mktemp: illegal option -- - usage: mktemp [-d] [-q] [-t prefix] [-u] template ... mktemp [-d] [-q] [-u] -t prefix [Wed Oct 4 10:49:19 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
=== Phase I: Generating training images === Rendering using MV Typewriter [Wed Oct 4 10:49:20 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir= --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable
You are getting error
mktemp: illegal option -- - usage: mktemp [-d] [-q] [-t prefix] [-u] template ... mktemp [-d] [-q] [-u] -t prefix
see tesstrain_utils.sh lines 29 and 172
training uses the /tmp directory for creating files
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Oct 4, 2017 at 11:20 AM, Sofwath [email protected] wrote:
Any help? I am getting this error
sudo /Users/sofwath/tesseract/training/tesstrain.sh --fonts_dir /Users/sofwath/dev/MLAI/tesseract/font/ --lang div --linedata_only --noextract_font_properties --langdata_dir langdata --tessdata_dir tessdata/ --output_dir divtrain/
=== Starting training for language 'div' mktemp: illegal option -- - usage: mktemp [-d] [-q] [-t prefix] [-u] template ... mktemp [-d] [-q] [-u] -t prefix [Wed Oct 4 10:49:19 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
=== Phase I: Generating training images === Rendering using MV Typewriter [Wed Oct 4 10:49:20 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir= --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_ n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-334056209, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4EjWq8HH-tVDxPhm-dAtx2zepsCks5soxyOgaJpZM4LW3EA .
changed export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX) to export FONT_CONFIG_CACHE=$(mktemp -d -tmpdir font_tmp.XXXXXXXXXX) in tesstrain_utils.sh and the first error was fixed but still get
=== Starting training for language 'div' /Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197: ${sample_path}: ambiguous redirect [Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx '--text' option is missing!
=== Phase I: Generating training images === Rendering using MV Typewriter [Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text '--text' option is missing! ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable
You have to look at your paths
'--text' option is missing!
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Oct 4, 2017 at 2:37 PM, Sofwath [email protected] wrote:
changed export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX) to export FONT_CONFIG_CACHE=$(mktemp -d -tmpdir font_tmp.XXXXXXXXXX) in tesstrain_utils.sh and the first error was fixed but still get
=== Starting training for language 'div' /Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197: ${sample_path}: ambiguous redirect [Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --text=/var/folders/zz/ zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --fontconfig_tmpdir=/var/ folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx '--text' option is missing!
=== Phase I: Generating training images === Rendering using MV Typewriter [Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_ n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_ n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text '--text' option is missing! ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-334094997, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-7BrowjvEOfly1dSuXAx6OHLr75ks5so0rSgaJpZM4LW3EA .
I followed this example
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only
--noextract_font_properties --langdata_dir ../langdata
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
for tesstrain.sh there is no --text command line option
/Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197: ${sample_path}: ambiguous redirect
If you are changing the bash script, you have to make sure it is done correctly. Please look at the error messages you get.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Oct 4, 2017 at 3:57 PM, Sofwath [email protected] wrote:
for tesstrain.sh there is no --text command line option
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/43#issuecomment-334114359, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxi-RmXHecvmFfDzyTeePmC2DLf6ks5so12ogaJpZM4LW3EA .
=== Starting training for language 'div'
[Thu Oct 5 19:39:30 DST 2017] /usr/local/bin/text2image --fonts_dir=/mnt/c/Windows/Fonts --font=MV Typewriter --outputbase=/tmp/font_tmp.v2PwMI2E8F/sample_text.txt --text=/tmp/font_tmp.v2PwMI2E8F/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.v2PwMI2E8F
Rendered page 0 to file /tmp/font_tmp.v2PwMI2E8F/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using MV Typewriter
[Thu Oct 5 19:40:32 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.v2PwMI2E8F --fonts_dir=/mnt/c/Windows/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=../langdata/div/div.training_text
Rendered page 0 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif
Rendered page 1 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif
Rendered page 2 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Thu Oct 5 19:40:42 DST 2017] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.jFPtcB8yoM/div/div.unicharset --norm_mode 2 /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.box
Extracting unicharset from box file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.box
Word started with a combiner:0x7b0
Normalization failed for string 'ްނ'
Word started with a combiner:0x7aa
Normalization failed for string 'ުށ'
Word started with a combiner:0x7ac
Normalization failed for string 'ެފ'
Word started with a combiner:0x7b0
Normalization failed for string 'ްށ'
Word started with a combiner:0x7ae
Normalization failed for string 'ޮކ'
Word started with a combiner:0x7b0
I was able to run the program. But there are errors. See attached log file.
I've been trying on Mac OS . giving too many errors on the bash scripts. I'll try to run the process on Linux
@Sofwath is the dhivehi training data usable now or is it abandoned? i have not seen any updates regarding this since 2017