tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

training failed for persian language with new font

Open mohsenomidi opened this issue 3 years ago • 22 comments

Dear All,

I am trying to train the tesseract with new font ("B Nazanin" attached to the issue) here is my steps, and I am using the langdata_lstm git and tessdata is the tessdata_best. and for fas.config i used atteched file the same as arabic, arabic and persian has same structure with similar letter and words. (but not exact the same).

but the fas.traineddata in here is not valid, i tying to use the apt installed file in my /usr/share/tesseract-ocr/5/tessdata direcotry. this file is fine.

with the fas.training_text in langdata_lstm repository during executing the tesstrain.py i got this error :

[22:09:35] INFO - Log file location: /tmp/fas-2022-01-011bwkauqw/tesstrain.log
[22:09:35] INFO - === Starting training for language fas
[22:09:35] INFO - Testing font: B Nazanin
[22:09:37] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][22:09:37] INFO - Rendering using B Nazanin
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.36s/it]
[22:09:48] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[22:09:48] INFO - === Phase E: Generating lstmf files ===
[22:09:48] INFO - Using fas.config
[22:09:48] INFO - Using TESSDATA_PREFIX=tesseract/tessdata
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][22:09:49] ERROR - Page 1
Failed to read boxes from /tmp/fas-2022-01-011bwkauqw/fas.B_Nazanin.exp0.tif
Error during processing.

[22:09:49] CRITICAL - Program /usr/bin/tesseract failed with return code 1. Abort.
  0%|                                                                                                                                                                                 | 0/1 [00:01<?, ?it/s]
Temporary files retained at: /tmp/fas-2022-01-011bwkauqw

and if i changed the fas.training_text to the attached file, the first step passed. In eval (second step) I get this error : Can't encode transcription: and Encoding of string failed! Failure bytes: for almost all texts

fas.lstm is not a recognition model, trying training checkpoint...
Loaded 406/406 lines (1-406) of document train/fas.B_Nazanin.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Encoding of string failed! Failure bytes: d9 81 d9 82 d9 88 d8 aa d9 85 20 d8 a7 d8 b1 20 d8 b3 d9 84 d8 b7 d8 a7 20 d8 b3 d9 88 d9 86 d8 a7 db 8c d9 82 d8 a7 20 d8 b2 d8 a7 d8 b1 d9 81 20 d8 b1 d8 a8 20 d8 af d9 88 d8 ae 20 db 8c d8 a7 d9 87 d8 b2 d8 a7 d9 88 d8 b1 d9 be 20 da a9 db 8c d8 aa d9 86 d8 a7 d9 84 d8 aa d8 a2 20 d9 86 db 8c d8 ac d8 b1 db 8c d9 88 20 d9 88 20 d8 b2 db 8c d9 88 d8 b1 db 8c d8 a7 20 d8 b4 db 8c d8 aa db 8c d8 b1 d8 a8 20 d8 8c d8 b3 d9 86 d8 a7 d8 b1 d9 81 d8 b1 db 8c d8 a7 20 d8 af d9 86 d9 86 d8 a7 d9 85 20 db 8c d9 84 d9 84 d9 85 d9 84 d8 a7 20 d9 86 db 8c d8 a8 20 db 8c db 8c d8 a7 d9 85 db 8c d9 be d8 a7 d9 88 d9 87 20 db 8c d8 a7 d9 87 d8 aa da a9 d8 b1 d8 b4
Can't encode transcription: 'فقوتم ار سلطا سونایقا زارف رب دوخ یاهزاورپ کیتنالتآ نیجریو و زیوریا شیتیرب ،سنارفریا دننام یللملا نیب ییامیپاوه یاهتکرش' in language ''
Encoding of string failed! Failure bytes: d8 b2 d8 a7 20 db 8c d8 b1 d8 a7 db 8c d8 b3 d8 a8 20 d9 88 20 d8 aa d8 b3 d8 a7 20 d8 af d9 88 d8 ac d9 88 d9 85 20 d8 b9 d8 b6 d9 88 20 d8 b1 d8 a8 d8 a7 d8 b1 d8 a8 20 d9 88 d8 af 20 d8 b1 d9 88 d8 b4 da a9 20 d8 b1 d8 af 20 db 8c d8 aa d8 a7 db 8c d9 84 d8 a7 d9 85 20 d8 aa db 8c d9 81 d8 b1 d8 b8 20 d9 87 da a9 20 d8 af db 8c d9 88 da af 20 db 8c d9 85 20 db 8c d9 86 db 8c d8 a8 d9 85 d9 85 20 db 8c d8 a7 d9 82 d8 a2 2e d8 af d9 86 da a9 20 d9 85 da a9 20 d8 aa d9 84 d9 88 d8 af 20 db 8c d9 85 d9 88 d9 85 d8 b9 20 d9 87 d8 ac d8 af d9 88 d8 a8 20 d8 b1 d8 af 20 d8 a7 d8 b1
Can't encode transcription: 'لغاشم زا یرایسب و تسا دوجوم عضو ربارب ود روشک رد یتایلام تیفرظ هک دیوگ یم ینیبمم یاقآ.دنک مک تلود یمومع هجدوب رد ار' in language ''
Encoding of string failed! Failure bytes: 2e d8 af db 8c d8 b3 d8 b1 20 d8 af d9 87 d8 a7 d9 88 d8 ae 20 d8 a7 da a9 db 8c d8 b1 d9 85 d8 a2 20 d8 b1 da af db 8c d8 af 20 d8 aa d9 84 d8 a7 db 8c d8 a7 20 d9 87 d8 af d8 b2 d8 a7 d9 88 d8 af 20 d9 87 d8 a8 20 d8 8c 20 d9 87 d8 af d9 86 db 8c d8 a2 20 d8 aa d8 b9 d8 a7 d8 b3 20 db b3 db b6 20 d8 a7 d8 aa 20 db b2 db b4 20 d9 81 d8 b1 d8 b8 20 db 8c d8 af d9 86 d8 b3 20 d9 86 d8 a7 d9 81 d9 88 d8 aa 20 d8 8c d9 86 d8 a7 d8 b3 d8 a7 d9 86 d8 b4 d8 b1 d8 a7 da a9 20 db 8c d9 86 db 8c d8 a8 20 d8 b4 db 8c d9 be 20 d8 b3 d8 a7 d8 b3 d8 a7 d8 b1 d8 a8 2e d8 af db 8c d8 b3 d8 b1

my first step :

rm -rf train/*
../tesstrain/src/training/tesstrain.py --fonts_dir fonts \
        --fontlist 'B Nazanin' \
        --ptsize 20 \
        --lang fas \
        --linedata_only \
        --langdata_dir langdata_lstm \
        --tessdata_dir tesseract/tessdata \
        --save_box_tiff \
        --maxpages 10 \
        --output_dir train

I also tried with different font size for above script.

second step :

lstmeval --model fas.lstm \
        --traineddata tesseract/tessdata/fas.traineddata \
        --eval_listfile train/fas.training_files.txt

after this step I should to extract the lstm from the best train file :

combine_tessdata -e tesseract/tessdata/fas.traineddata fas.lstm

as i described above the extraction lstm is failed with traineddata in best repository, and i just used the installed version.

returned result :

Extracting tessdata components from tesseract/tessdata/fas.traineddata
Wrote fas.lstm
Version:5.0.0
17:lstm:size=2965531, offset=192
21:lstm-unicharset:size=1978, offset=2965723
22:lstm-recoder:size=301, offset=2967701
23:version:size=5, offset=2968002

here is my next step to fine tune the learning but it also retuned Can't encode transcription and Encoding of string failed! Failure bytes error for all texts

rm -rf output/*
OMP_THREAD_LIMIT=16 lstmtraining \
        --continue_from fas.lstm \
        --model_output output/moh \
        --traineddata tesseract/tessdata/fas.traineddata \
        --train_listfile train/fas.training_files.txt \
        --max_iterations 1000

attached files : 1- TTF font file 2- fas.config 3- fas.training_text (this is sample that works with script) (the langdata_lstm , training_text returned error in first step)

is there any solutions ?

IssueAttachments.zip

mohsenomidi avatar Jan 01 '22 19:01 mohsenomidi

Happy new year to everyone

I tried many times with different configurations, but didn't succeed...

Is there any Idea or solutions?

mohsenomidi avatar Jan 04 '22 07:01 mohsenomidi

as i described above the extraction lstm is failed with traineddata in best repository, and i just used the installed version. returned result : Extracting tessdata components from tesseract/tessdata/fas.traineddata Wrote fas.lstm Version:5.0.0 17:lstm:size=2965531, offset=192 21:lstm-unicharset:size=1978, offset=2965723 22:lstm-recoder:size=301, offset=2967701 23:version:size=5, offset=2968002

I am not able to reproduce the above results. File from tessdata_best works fine for me.

Results from tessdata_best, tessdata_fast and tessdata below.

$ combine_tessdata -dl ~/tessdata_best/fas.traineddata

Version:4.00.00alpha:fas:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]
17:lstm:size=3177995, offset=192
18:lstm-punc-dawg:size=1362, offset=3178187
19:lstm-word-dawg:size=128986, offset=3179549
20:lstm-number-dawg:size=10810, offset=3308535
21:lstm-unicharset:size=5667, offset=3319345
22:lstm-recoder:size=859, offset=3325012
23:version:size=80, offset=3325871
LSTM: network=[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1], int_mode=0, recoding=1, iteration=896400, sample_iteration=897843, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.00025, :2(Maxpool)=0.001, :3:0(Lfys64)=0.00025, :4(Lfx96)=0.00025, :5:0(Lrx96)=0.00025, :6(Lfx192)=0.00025, :7(Output)=0.00025
$ combine_tessdata -dl ~/tessdata_fast/fas.traineddata
Version:4.00.00alpha:fas:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx128O1c1]
17:lstm:size=283540, offset=192
18:lstm-punc-dawg:size=1362, offset=283732
19:lstm-word-dawg:size=128986, offset=285094
20:lstm-number-dawg:size=10810, offset=414080
21:lstm-unicharset:size=5667, offset=424890
22:lstm-recoder:size=859, offset=430557
23:version:size=80, offset=431416
LSTM: network=[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx128O1c1], int_mode=1, recoding=1, iteration=2762200, sample_iteration=2773866, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.000125, :2(Maxpool)=0.001, :3:0(Lfys48)=0.000125, :4(Lfx96)=0.000125, :5:0(Lrx96)=0.000125, :6(Lfx128)=0.000125, :7(Output)=0.000125
$ combine_tessdata -dl ~/tessdata/fas.traineddata
Version:4.00.00alpha:fas:best2int20180322
0:config:size=27, offset=192
17:lstm:size=413332, offset=219
18:lstm-punc-dawg:size=1362, offset=413551
19:lstm-word-dawg:size=128986, offset=414913
20:lstm-number-dawg:size=10810, offset=543899
21:lstm-unicharset:size=5667, offset=554709
22:lstm-recoder:size=859, offset=560376
23:version:size=33, offset=561235
LSTM: network=[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1], int_mode=1, recoding=1, iteration=896400, sample_iteration=897843, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.00025, :2(Maxpool)=0.001, :3:0(Lfys64)=0.00025, :4(Lfx96)=0.00025, :5:0(Lrx96)=0.00025, :6(Lfx192)=0.00025, :7(Output)=0.00025

Shreeshrii avatar Jan 04 '22 16:01 Shreeshrii

@Shreeshrii Thank you so much for your reply I don't understand what was happened before, I just clone the best repository again now and the second and 3rd phase works fine.

but the first problem already exist with new clone:

i just copied the fas.traineddata from best to my tesseract/tessdata directory

and execute the command below to generate the new tif file for new font :

../tesstrain/src/training/tesstrain.py --fonts_dir fonts \
        --fontlist 'B Nazanin' \
        --ptsize 20 \
        --lang fas \
        --linedata_only \
        --langdata_dir langdata_lstm \
        --tessdata_dir tesseract/tessdata \
        --save_box_tiff \
        --maxpages 10 \
        --output_dir train

the return error is :

[20:53:28] INFO - Log file location: /tmp/fas-2022-01-0474uqtjuu/tesstrain.log
[20:53:28] INFO - === Starting training for language fas
[20:53:28] INFO - Testing font: B Nazanin
[20:53:29] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][20:53:29] INFO - Rendering using B Nazanin
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.35s/it]
[20:53:41] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[20:53:41] INFO - === Phase E: Generating lstmf files ===
[20:53:41] INFO - Using fas.config
[20:53:41] INFO - Using TESSDATA_PREFIX=tesseract/tessdata
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][20:53:42] ERROR - Page 1
Failed to read boxes from /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Error during processing.

[20:53:42] CRITICAL - Program /usr/bin/tesseract failed with return code 1. Abort.
  0%|                                                                                                                                                                                 | 0/1 [00:01<?, ?it/s]
Temporary files retained at: /tmp/fas-2022-01-0474uqtjuu

log file :

[2022-01-04 20:53:28,307] - INFO - root - === Starting training for language fas
[2022-01-04 20:53:28,307] - DEBUG - language_specific - ambigs_filter_denominator = 100000
[2022-01-04 20:53:28,308] - DEBUG - language_specific - bigram_dawg_factor = 0.015
[2022-01-04 20:53:28,308] - DEBUG - language_specific - exposures = [0] (was None)
[2022-01-04 20:53:28,308] - DEBUG - language_specific - filter_arguments = []
[2022-01-04 20:53:28,308] - DEBUG - language_specific - fonts = ['B Nazanin'] (set on cmdline)
[2022-01-04 20:53:28,309] - DEBUG - language_specific - fragments_disabled = y
[2022-01-04 20:53:28,309] - DEBUG - language_specific - generate_word_bigrams = None
[2022-01-04 20:53:28,309] - DEBUG - language_specific - lang_is_rtl = True
[2022-01-04 20:53:28,309] - DEBUG - language_specific - leading = 32
[2022-01-04 20:53:28,309] - DEBUG - language_specific - mean_count = 40
[2022-01-04 20:53:28,309] - DEBUG - language_specific - mix_lang = eng
[2022-01-04 20:53:28,309] - DEBUG - language_specific - norm_mode = 2
[2022-01-04 20:53:28,310] - DEBUG - language_specific - number_dawg_factor = 0.125
[2022-01-04 20:53:28,310] - DEBUG - language_specific - punc_dawg_factor = None
[2022-01-04 20:53:28,310] - DEBUG - language_specific - run_shape_clustering = False (set on cmdline)
[2022-01-04 20:53:28,310] - DEBUG - language_specific - text2image_extra_args = []
[2022-01-04 20:53:28,310] - DEBUG - language_specific - text_corpus = /fas.corpus.txt
[2022-01-04 20:53:28,310] - DEBUG - language_specific - training_data_arguments = []
[2022-01-04 20:53:28,311] - DEBUG - language_specific - word_dawg_factor = 0.05
[2022-01-04 20:53:28,311] - DEBUG - language_specific - word_dawg_size = None
[2022-01-04 20:53:28,311] - DEBUG - language_specific - wordlist2dawg_arguments =
[2022-01-04 20:53:28,312] - INFO - tesstrain_utils - Testing font: B Nazanin
[2022-01-04 20:53:28,312] - DEBUG - tesstrain_utils - Running /usr/bin/text2image
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --fonts_dir=fonts
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --font=B Nazanin
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --outputbase=/tmp/font_tmp2ur0uyxt/sample_text.txt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --text=/tmp/font_tmp2ur0uyxt/sample_text.txt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --fontconfig_tmpdir=/tmp/font_tmp2ur0uyxt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --ptsize=20
[2022-01-04 20:53:29,624] - DEBUG - /usr/bin/text2image - Stripped 1 unrenderable words
Rendered page 0 to file /tmp/font_tmp2ur0uyxt/sample_text.txt.tif

[2022-01-04 20:53:29,625] - INFO - tesstrain_utils - === Phase I: Generating training images ===
[2022-01-04 20:53:29,658] - INFO - tesstrain_utils - Rendering using B Nazanin
[2022-01-04 20:53:29,659] - DEBUG - tesstrain_utils - Running /usr/bin/text2image
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --fontconfig_tmpdir=/tmp/font_tmp2ur0uyxt
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --fonts_dir=fonts
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --strip_unrenderable_words
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --leading=32
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --char_spacing=0.0
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --exposure=0
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --outputbase=/tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --max_pages=10
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --font=B Nazanin
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --text=langdata_lstm/fas/fas.training_text
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --ptsize=20
[2022-01-04 20:53:40,998] - DEBUG - /usr/bin/text2image - Stripped 25 unrenderable words
Rendered page 0 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 18 unrenderable words
Rendered page 1 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 2 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 26 unrenderable words
Rendered page 3 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 23 unrenderable words
Error in boxCreate: y < 0 and box off +quad
Rendered page 4 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 27 unrenderable words
Rendered page 5 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 27 unrenderable words
Rendered page 6 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 21 unrenderable words
Rendered page 7 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 8 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 9 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!

[2022-01-04 20:53:41,003] - INFO - tesstrain_utils - === Phase UP: Generating unicharset and unichar properties files ===
[2022-01-04 20:53:41,005] - DEBUG - tesstrain_utils - Running /usr/bin/unicharset_extractor
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - --output_unicharset
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - --norm_mode
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - 2
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.box
[2022-01-04 20:53:41,026] - DEBUG - /usr/bin/unicharset_extractor - Failed to read data from: /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.box
Wrote unicharset file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset

[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - Running /usr/bin/set_unicharset_properties
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - -U
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - -O
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - -X
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.xheights
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - --script_dir=langdata_lstm
[2022-01-04 20:53:41,082] - DEBUG - /usr/bin/set_unicharset_properties - Loaded unicharset of size 3 from file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset

[2022-01-04 20:53:41,083] - INFO - tesstrain_utils - === Phase E: Generating lstmf files ===
[2022-01-04 20:53:41,083] - DEBUG - tesstrain_utils - [PosixPath('/tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif')]
[2022-01-04 20:53:41,084] - INFO - tesstrain_utils - Using fas.config
[2022-01-04 20:53:41,084] - INFO - tesstrain_utils - Using TESSDATA_PREFIX=tesseract/tessdata
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - Running /usr/bin/tesseract
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0
[2022-01-04 20:53:41,087] - DEBUG - tesstrain_utils - lstm.train
[2022-01-04 20:53:41,087] - DEBUG - tesstrain_utils - langdata_lstm/fas/fas.config
[2022-01-04 20:53:42,343] - ERROR - /usr/bin/tesseract - Page 1
Failed to read boxes from /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Error during processing.

[2022-01-04 20:53:42,344] - CRITICAL - tesstrain_utils - Program /usr/bin/tesseract failed with return code 1. Abort.
                                                                                                                                            

woud you please check this error ?

I appreciate you 🙇‍♂️

mohsenomidi avatar Jan 04 '22 17:01 mohsenomidi

Your training_text has very long lines as well as English text.

Why don't you test by using the fas.training_text from langdata repo which will be a smaller file and see if that works.

Shreeshrii avatar Jan 04 '22 17:01 Shreeshrii

I am just using these files langdata_lstm error happening with repository above

do you mean using this repo : langdata ?

mohsenomidi avatar Jan 04 '22 17:01 mohsenomidi

Problem with text2image program - see outstanding issues in tesseract repo 0 https://github.com/tesseract-ocr/tesseract/issues/3563

Shreeshrii avatar Jan 05 '22 04:01 Shreeshrii

@Shreeshrii Thanks for your help, i will continue in that thread.

mohsenomidi avatar Jan 05 '22 10:01 mohsenomidi

Hello. I´m a little bit confused about combine_lang_model. This documentation https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#training-text-requirements says that the combine_lang_model extracts data from an unicharset file: "A new tool: combine_lang_model is provided to make a starter traineddata from a unicharset and optional wordlists." Hope this can help you in some ways.

TheFattestTony avatar Jan 05 '22 16:01 TheFattestTony

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 08:04 stale[bot]

Waiting for linked issue investigation result

mohsenomidi avatar Apr 16 '22 15:04 mohsenomidi

Hi, could you please explain how you resolved the "Encoding of string failed! Failure bytes" error? image

Nadiam75 avatar May 12 '22 05:05 Nadiam75

As you see the history of this issue, the problem is the tesseract core bug. The related issue opened in the main repository and linked here, you can follow up from that thread.

mohsenomidi avatar May 12 '22 06:05 mohsenomidi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 13 '22 01:06 stale[bot]

Waiting for referenced issue

mohsenomidi avatar Jun 13 '22 04:06 mohsenomidi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 13 '22 09:08 stale[bot]

Still waiting

mohsenomidi avatar Aug 16 '22 18:08 mohsenomidi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 02 '22 01:11 stale[bot]

Waiting for referenced issue

mohsenomidi avatar Nov 02 '22 05:11 mohsenomidi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 08 '23 02:01 stale[bot]

Still waiting for response

mohsenomidi avatar Jan 08 '23 15:01 mohsenomidi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 22 '23 01:05 stale[bot]

Still waiting for response

mohsenomidi avatar May 22 '23 22:05 mohsenomidi