tesstrain
tesstrain copied to clipboard
training failed for persian language with new font
Dear All,
I am trying to train the tesseract with new font ("B Nazanin" attached to the issue) here is my steps, and I am using the langdata_lstm
git and tessdata is the tessdata_best
. and for fas.config
i used atteched file the same as arabic, arabic and persian has same structure with similar letter and words. (but not exact the same).
but the fas.traineddata in here is not valid, i tying to use the apt installed file in my /usr/share/tesseract-ocr/5/tessdata
direcotry. this file is fine.
with the fas.training_text
in langdata_lstm
repository during executing the tesstrain.py
i got this error :
[22:09:35] INFO - Log file location: /tmp/fas-2022-01-011bwkauqw/tesstrain.log
[22:09:35] INFO - === Starting training for language fas
[22:09:35] INFO - Testing font: B Nazanin
[22:09:37] INFO - === Phase I: Generating training images ===
0%| | 0/1 [00:00<?, ?it/s][22:09:37] INFO - Rendering using B Nazanin
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.36s/it]
[22:09:48] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[22:09:48] INFO - === Phase E: Generating lstmf files ===
[22:09:48] INFO - Using fas.config
[22:09:48] INFO - Using TESSDATA_PREFIX=tesseract/tessdata
0%| | 0/1 [00:00<?, ?it/s][22:09:49] ERROR - Page 1
Failed to read boxes from /tmp/fas-2022-01-011bwkauqw/fas.B_Nazanin.exp0.tif
Error during processing.
[22:09:49] CRITICAL - Program /usr/bin/tesseract failed with return code 1. Abort.
0%| | 0/1 [00:01<?, ?it/s]
Temporary files retained at: /tmp/fas-2022-01-011bwkauqw
and if i changed the fas.training_text
to the attached file, the first step passed.
In eval (second step) I get this error : Can't encode transcription:
and Encoding of string failed! Failure bytes:
for almost all texts
fas.lstm is not a recognition model, trying training checkpoint...
Loaded 406/406 lines (1-406) of document train/fas.B_Nazanin.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Encoding of string failed! Failure bytes: d9 81 d9 82 d9 88 d8 aa d9 85 20 d8 a7 d8 b1 20 d8 b3 d9 84 d8 b7 d8 a7 20 d8 b3 d9 88 d9 86 d8 a7 db 8c d9 82 d8 a7 20 d8 b2 d8 a7 d8 b1 d9 81 20 d8 b1 d8 a8 20 d8 af d9 88 d8 ae 20 db 8c d8 a7 d9 87 d8 b2 d8 a7 d9 88 d8 b1 d9 be 20 da a9 db 8c d8 aa d9 86 d8 a7 d9 84 d8 aa d8 a2 20 d9 86 db 8c d8 ac d8 b1 db 8c d9 88 20 d9 88 20 d8 b2 db 8c d9 88 d8 b1 db 8c d8 a7 20 d8 b4 db 8c d8 aa db 8c d8 b1 d8 a8 20 d8 8c d8 b3 d9 86 d8 a7 d8 b1 d9 81 d8 b1 db 8c d8 a7 20 d8 af d9 86 d9 86 d8 a7 d9 85 20 db 8c d9 84 d9 84 d9 85 d9 84 d8 a7 20 d9 86 db 8c d8 a8 20 db 8c db 8c d8 a7 d9 85 db 8c d9 be d8 a7 d9 88 d9 87 20 db 8c d8 a7 d9 87 d8 aa da a9 d8 b1 d8 b4
Can't encode transcription: 'فقوتم ار سلطا سونایقا زارف رب دوخ یاهزاورپ کیتنالتآ نیجریو و زیوریا شیتیرب ،سنارفریا دننام یللملا نیب ییامیپاوه یاهتکرش' in language ''
Encoding of string failed! Failure bytes: d8 b2 d8 a7 20 db 8c d8 b1 d8 a7 db 8c d8 b3 d8 a8 20 d9 88 20 d8 aa d8 b3 d8 a7 20 d8 af d9 88 d8 ac d9 88 d9 85 20 d8 b9 d8 b6 d9 88 20 d8 b1 d8 a8 d8 a7 d8 b1 d8 a8 20 d9 88 d8 af 20 d8 b1 d9 88 d8 b4 da a9 20 d8 b1 d8 af 20 db 8c d8 aa d8 a7 db 8c d9 84 d8 a7 d9 85 20 d8 aa db 8c d9 81 d8 b1 d8 b8 20 d9 87 da a9 20 d8 af db 8c d9 88 da af 20 db 8c d9 85 20 db 8c d9 86 db 8c d8 a8 d9 85 d9 85 20 db 8c d8 a7 d9 82 d8 a2 2e d8 af d9 86 da a9 20 d9 85 da a9 20 d8 aa d9 84 d9 88 d8 af 20 db 8c d9 85 d9 88 d9 85 d8 b9 20 d9 87 d8 ac d8 af d9 88 d8 a8 20 d8 b1 d8 af 20 d8 a7 d8 b1
Can't encode transcription: 'لغاشم زا یرایسب و تسا دوجوم عضو ربارب ود روشک رد یتایلام تیفرظ هک دیوگ یم ینیبمم یاقآ.دنک مک تلود یمومع هجدوب رد ار' in language ''
Encoding of string failed! Failure bytes: 2e d8 af db 8c d8 b3 d8 b1 20 d8 af d9 87 d8 a7 d9 88 d8 ae 20 d8 a7 da a9 db 8c d8 b1 d9 85 d8 a2 20 d8 b1 da af db 8c d8 af 20 d8 aa d9 84 d8 a7 db 8c d8 a7 20 d9 87 d8 af d8 b2 d8 a7 d9 88 d8 af 20 d9 87 d8 a8 20 d8 8c 20 d9 87 d8 af d9 86 db 8c d8 a2 20 d8 aa d8 b9 d8 a7 d8 b3 20 db b3 db b6 20 d8 a7 d8 aa 20 db b2 db b4 20 d9 81 d8 b1 d8 b8 20 db 8c d8 af d9 86 d8 b3 20 d9 86 d8 a7 d9 81 d9 88 d8 aa 20 d8 8c d9 86 d8 a7 d8 b3 d8 a7 d9 86 d8 b4 d8 b1 d8 a7 da a9 20 db 8c d9 86 db 8c d8 a8 20 d8 b4 db 8c d9 be 20 d8 b3 d8 a7 d8 b3 d8 a7 d8 b1 d8 a8 2e d8 af db 8c d8 b3 d8 b1
my first step :
rm -rf train/*
../tesstrain/src/training/tesstrain.py --fonts_dir fonts \
--fontlist 'B Nazanin' \
--ptsize 20 \
--lang fas \
--linedata_only \
--langdata_dir langdata_lstm \
--tessdata_dir tesseract/tessdata \
--save_box_tiff \
--maxpages 10 \
--output_dir train
I also tried with different font size for above script.
second step :
lstmeval --model fas.lstm \
--traineddata tesseract/tessdata/fas.traineddata \
--eval_listfile train/fas.training_files.txt
after this step I should to extract the lstm from the best train file :
combine_tessdata -e tesseract/tessdata/fas.traineddata fas.lstm
as i described above the extraction lstm is failed with traineddata in best repository, and i just used the installed version.
returned result :
Extracting tessdata components from tesseract/tessdata/fas.traineddata
Wrote fas.lstm
Version:5.0.0
17:lstm:size=2965531, offset=192
21:lstm-unicharset:size=1978, offset=2965723
22:lstm-recoder:size=301, offset=2967701
23:version:size=5, offset=2968002
here is my next step to fine tune the learning but it also retuned Can't encode transcription
and Encoding of string failed! Failure bytes
error for all texts
rm -rf output/*
OMP_THREAD_LIMIT=16 lstmtraining \
--continue_from fas.lstm \
--model_output output/moh \
--traineddata tesseract/tessdata/fas.traineddata \
--train_listfile train/fas.training_files.txt \
--max_iterations 1000
attached files : 1- TTF font file 2- fas.config 3- fas.training_text (this is sample that works with script) (the langdata_lstm , training_text returned error in first step)
is there any solutions ?
Happy new year to everyone
I tried many times with different configurations, but didn't succeed...
Is there any Idea or solutions?
as i described above the extraction lstm is failed with traineddata in best repository, and i just used the installed version. returned result : Extracting tessdata components from tesseract/tessdata/fas.traineddata Wrote fas.lstm Version:5.0.0 17:lstm:size=2965531, offset=192 21:lstm-unicharset:size=1978, offset=2965723 22:lstm-recoder:size=301, offset=2967701 23:version:size=5, offset=2968002
I am not able to reproduce the above results. File from tessdata_best works fine for me.
Results from tessdata_best, tessdata_fast and tessdata below.
$ combine_tessdata -dl ~/tessdata_best/fas.traineddata
Version:4.00.00alpha:fas:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]
17:lstm:size=3177995, offset=192
18:lstm-punc-dawg:size=1362, offset=3178187
19:lstm-word-dawg:size=128986, offset=3179549
20:lstm-number-dawg:size=10810, offset=3308535
21:lstm-unicharset:size=5667, offset=3319345
22:lstm-recoder:size=859, offset=3325012
23:version:size=80, offset=3325871
LSTM: network=[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1], int_mode=0, recoding=1, iteration=896400, sample_iteration=897843, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.00025, :2(Maxpool)=0.001, :3:0(Lfys64)=0.00025, :4(Lfx96)=0.00025, :5:0(Lrx96)=0.00025, :6(Lfx192)=0.00025, :7(Output)=0.00025
$ combine_tessdata -dl ~/tessdata_fast/fas.traineddata
Version:4.00.00alpha:fas:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx128O1c1]
17:lstm:size=283540, offset=192
18:lstm-punc-dawg:size=1362, offset=283732
19:lstm-word-dawg:size=128986, offset=285094
20:lstm-number-dawg:size=10810, offset=414080
21:lstm-unicharset:size=5667, offset=424890
22:lstm-recoder:size=859, offset=430557
23:version:size=80, offset=431416
LSTM: network=[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx128O1c1], int_mode=1, recoding=1, iteration=2762200, sample_iteration=2773866, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.000125, :2(Maxpool)=0.001, :3:0(Lfys48)=0.000125, :4(Lfx96)=0.000125, :5:0(Lrx96)=0.000125, :6(Lfx128)=0.000125, :7(Output)=0.000125
$ combine_tessdata -dl ~/tessdata/fas.traineddata
Version:4.00.00alpha:fas:best2int20180322
0:config:size=27, offset=192
17:lstm:size=413332, offset=219
18:lstm-punc-dawg:size=1362, offset=413551
19:lstm-word-dawg:size=128986, offset=414913
20:lstm-number-dawg:size=10810, offset=543899
21:lstm-unicharset:size=5667, offset=554709
22:lstm-recoder:size=859, offset=560376
23:version:size=33, offset=561235
LSTM: network=[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1], int_mode=1, recoding=1, iteration=896400, sample_iteration=897843, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.00025, :2(Maxpool)=0.001, :3:0(Lfys64)=0.00025, :4(Lfx96)=0.00025, :5:0(Lrx96)=0.00025, :6(Lfx192)=0.00025, :7(Output)=0.00025
@Shreeshrii Thank you so much for your reply I don't understand what was happened before, I just clone the best repository again now and the second and 3rd phase works fine.
but the first problem already exist with new clone:
i just copied the fas.traineddata
from best to my tesseract/tessdata
directory
and execute the command below to generate the new tif file for new font :
../tesstrain/src/training/tesstrain.py --fonts_dir fonts \
--fontlist 'B Nazanin' \
--ptsize 20 \
--lang fas \
--linedata_only \
--langdata_dir langdata_lstm \
--tessdata_dir tesseract/tessdata \
--save_box_tiff \
--maxpages 10 \
--output_dir train
the return error is :
[20:53:28] INFO - Log file location: /tmp/fas-2022-01-0474uqtjuu/tesstrain.log
[20:53:28] INFO - === Starting training for language fas
[20:53:28] INFO - Testing font: B Nazanin
[20:53:29] INFO - === Phase I: Generating training images ===
0%| | 0/1 [00:00<?, ?it/s][20:53:29] INFO - Rendering using B Nazanin
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.35s/it]
[20:53:41] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[20:53:41] INFO - === Phase E: Generating lstmf files ===
[20:53:41] INFO - Using fas.config
[20:53:41] INFO - Using TESSDATA_PREFIX=tesseract/tessdata
0%| | 0/1 [00:00<?, ?it/s][20:53:42] ERROR - Page 1
Failed to read boxes from /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Error during processing.
[20:53:42] CRITICAL - Program /usr/bin/tesseract failed with return code 1. Abort.
0%| | 0/1 [00:01<?, ?it/s]
Temporary files retained at: /tmp/fas-2022-01-0474uqtjuu
log file :
[2022-01-04 20:53:28,307] - INFO - root - === Starting training for language fas
[2022-01-04 20:53:28,307] - DEBUG - language_specific - ambigs_filter_denominator = 100000
[2022-01-04 20:53:28,308] - DEBUG - language_specific - bigram_dawg_factor = 0.015
[2022-01-04 20:53:28,308] - DEBUG - language_specific - exposures = [0] (was None)
[2022-01-04 20:53:28,308] - DEBUG - language_specific - filter_arguments = []
[2022-01-04 20:53:28,308] - DEBUG - language_specific - fonts = ['B Nazanin'] (set on cmdline)
[2022-01-04 20:53:28,309] - DEBUG - language_specific - fragments_disabled = y
[2022-01-04 20:53:28,309] - DEBUG - language_specific - generate_word_bigrams = None
[2022-01-04 20:53:28,309] - DEBUG - language_specific - lang_is_rtl = True
[2022-01-04 20:53:28,309] - DEBUG - language_specific - leading = 32
[2022-01-04 20:53:28,309] - DEBUG - language_specific - mean_count = 40
[2022-01-04 20:53:28,309] - DEBUG - language_specific - mix_lang = eng
[2022-01-04 20:53:28,309] - DEBUG - language_specific - norm_mode = 2
[2022-01-04 20:53:28,310] - DEBUG - language_specific - number_dawg_factor = 0.125
[2022-01-04 20:53:28,310] - DEBUG - language_specific - punc_dawg_factor = None
[2022-01-04 20:53:28,310] - DEBUG - language_specific - run_shape_clustering = False (set on cmdline)
[2022-01-04 20:53:28,310] - DEBUG - language_specific - text2image_extra_args = []
[2022-01-04 20:53:28,310] - DEBUG - language_specific - text_corpus = /fas.corpus.txt
[2022-01-04 20:53:28,310] - DEBUG - language_specific - training_data_arguments = []
[2022-01-04 20:53:28,311] - DEBUG - language_specific - word_dawg_factor = 0.05
[2022-01-04 20:53:28,311] - DEBUG - language_specific - word_dawg_size = None
[2022-01-04 20:53:28,311] - DEBUG - language_specific - wordlist2dawg_arguments =
[2022-01-04 20:53:28,312] - INFO - tesstrain_utils - Testing font: B Nazanin
[2022-01-04 20:53:28,312] - DEBUG - tesstrain_utils - Running /usr/bin/text2image
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --fonts_dir=fonts
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --font=B Nazanin
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --outputbase=/tmp/font_tmp2ur0uyxt/sample_text.txt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --text=/tmp/font_tmp2ur0uyxt/sample_text.txt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --fontconfig_tmpdir=/tmp/font_tmp2ur0uyxt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --ptsize=20
[2022-01-04 20:53:29,624] - DEBUG - /usr/bin/text2image - Stripped 1 unrenderable words
Rendered page 0 to file /tmp/font_tmp2ur0uyxt/sample_text.txt.tif
[2022-01-04 20:53:29,625] - INFO - tesstrain_utils - === Phase I: Generating training images ===
[2022-01-04 20:53:29,658] - INFO - tesstrain_utils - Rendering using B Nazanin
[2022-01-04 20:53:29,659] - DEBUG - tesstrain_utils - Running /usr/bin/text2image
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --fontconfig_tmpdir=/tmp/font_tmp2ur0uyxt
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --fonts_dir=fonts
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --strip_unrenderable_words
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --leading=32
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --char_spacing=0.0
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --exposure=0
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --outputbase=/tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --max_pages=10
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --font=B Nazanin
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --text=langdata_lstm/fas/fas.training_text
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --ptsize=20
[2022-01-04 20:53:40,998] - DEBUG - /usr/bin/text2image - Stripped 25 unrenderable words
Rendered page 0 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 18 unrenderable words
Rendered page 1 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 2 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 26 unrenderable words
Rendered page 3 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 23 unrenderable words
Error in boxCreate: y < 0 and box off +quad
Rendered page 4 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 27 unrenderable words
Rendered page 5 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 27 unrenderable words
Rendered page 6 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 21 unrenderable words
Rendered page 7 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 8 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 9 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!
[2022-01-04 20:53:41,003] - INFO - tesstrain_utils - === Phase UP: Generating unicharset and unichar properties files ===
[2022-01-04 20:53:41,005] - DEBUG - tesstrain_utils - Running /usr/bin/unicharset_extractor
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - --output_unicharset
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - --norm_mode
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - 2
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.box
[2022-01-04 20:53:41,026] - DEBUG - /usr/bin/unicharset_extractor - Failed to read data from: /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.box
Wrote unicharset file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - Running /usr/bin/set_unicharset_properties
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - -U
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - -O
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - -X
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.xheights
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - --script_dir=langdata_lstm
[2022-01-04 20:53:41,082] - DEBUG - /usr/bin/set_unicharset_properties - Loaded unicharset of size 3 from file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,083] - INFO - tesstrain_utils - === Phase E: Generating lstmf files ===
[2022-01-04 20:53:41,083] - DEBUG - tesstrain_utils - [PosixPath('/tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif')]
[2022-01-04 20:53:41,084] - INFO - tesstrain_utils - Using fas.config
[2022-01-04 20:53:41,084] - INFO - tesstrain_utils - Using TESSDATA_PREFIX=tesseract/tessdata
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - Running /usr/bin/tesseract
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0
[2022-01-04 20:53:41,087] - DEBUG - tesstrain_utils - lstm.train
[2022-01-04 20:53:41,087] - DEBUG - tesstrain_utils - langdata_lstm/fas/fas.config
[2022-01-04 20:53:42,343] - ERROR - /usr/bin/tesseract - Page 1
Failed to read boxes from /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Error during processing.
[2022-01-04 20:53:42,344] - CRITICAL - tesstrain_utils - Program /usr/bin/tesseract failed with return code 1. Abort.
woud you please check this error ?
I appreciate you 🙇♂️
Your training_text has very long lines as well as English text.
Why don't you test by using the fas.training_text from langdata repo which will be a smaller file and see if that works.
I am just using these files langdata_lstm error happening with repository above
do you mean using this repo : langdata ?
Problem with text2image
program - see outstanding issues in tesseract repo 0 https://github.com/tesseract-ocr/tesseract/issues/3563
@Shreeshrii Thanks for your help, i will continue in that thread.
Hello. I´m a little bit confused about combine_lang_model. This documentation https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#training-text-requirements says that the combine_lang_model extracts data from an unicharset file: "A new tool: combine_lang_model is provided to make a starter traineddata from a unicharset and optional wordlists." Hope this can help you in some ways.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Waiting for linked issue investigation result
Hi, could you please explain how you resolved the "Encoding of string failed! Failure bytes" error?
As you see the history of this issue, the problem is the tesseract core bug. The related issue opened in the main repository and linked here, you can follow up from that thread.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Waiting for referenced issue
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Still waiting
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Waiting for referenced issue
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Still waiting for response
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Still waiting for response