tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Cannot build the new traineddata

Open joecheung051 opened this issue 1 year ago • 0 comments

Environment

  • Tesseract Version: <Version('5.2.0.20220712')>
  • Commit Number:
  • Platform: windows 10 home 64 bits

Current Behavior:

=== Starting training for language 'chi_tra' [Tue Aug 2 16:44:08 2022] /c/Program Files/Tesseract-OCR/text2image --fonts_dir=fonts --font=PMingLiU Book --outputbase=/tmp/font_tmp.S3O7LNYjEB/sample_text.txt --text=/tmp/font_tmp.S3O7LNYjEB/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.S3O7LNYjEB Rendered page 0 to file C:/Users/joe_c/AppData/Local/Temp/font_tmp.S3O7LNYjEB/sample_text.txt.tif

=== Phase I: Generating training images === Rendering using PMingLiU Book [Tue Aug 2 16:44:10 2022] /c/Program Files/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.S3O7LNYjEB --fonts_dir=fonts --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0 --max_pages=10 --font=PMingLiU Book --text=langdata_lstm/chi_tra/chi_tra.training_text Rendered page 0 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Stripped 1 unrenderable words Rendered page 1 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Rendered page 2 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Rendered page 3 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Rendered page 4 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Stripped 2 unrenderable words Rendered page 5 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Rendered page 6 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Rendered page 7 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Rendered page 8 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif Rendered page 9 to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files === [Tue Aug 2 16:44:15 2022] /c/Program Files/Tesseract-OCR/unicharset_extractor --output_unicharset /tmp/chi_tra-2022-08-02.2wo/chi_tra.unicharset --norm_mode 1 /tmp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.box Extracting unicharset from box file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.box Mirror 〕 of 〔 is not in unicharset Wrote unicharset file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.unicharset [Tue Aug 2 16:44:15 2022] /c/Program Files/Tesseract-OCR/set_unicharset_properties -U /tmp/chi_tra-2022-08-02.2wo/chi_tra.unicharset -O /tmp/chi_tra-2022-08-02.2wo/chi_tra.unicharset -X /tmp/chi_tra-2022-08-02.2wo/chi_tra.xheights --script_dir=langdata_lstm Loaded unicharset of size 2849 from file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.unicharset Setting unichar properties Mirror 〕 of 〔 is not in unicharset Setting script properties Warning: properties incomplete for index 2515 = , Writing unicharset to file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.unicharset

=== Phase E: Generating lstmf files === Using TESSDATA_PREFIX=tessdata [Tue Aug 2 16:44:16 2022] /c/Program Files/Tesseract-OCR/tesseract /tmp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif /tmp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0 --psm 6 lstm.train langdata_lstm/chi_tra/chi_tra.config Page 1 Page 2 Loaded 36/36 lines (1-36) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf Page 3 Loaded 71/71 lines (1-71) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf Page 4 Loaded 105/105 lines (1-105) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf Page 5 Loaded 140/140 lines (1-140) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf Page 6 Loaded 175/175 lines (1-175) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf Page 7 Loaded 210/210 lines (1-210) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf Page 8 Loaded 246/246 lines (1-246) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf Page 9 Loaded 281/281 lines (1-281) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf Page 10 Loaded 315/315 lines (1-315) of document C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf

=== Constructing LSTM training data === [Tue Aug 2 16:44:20 2022] /c/Program Files/Tesseract-OCR/combine_lang_model --input_unicharset /tmp/chi_tra-2022-08-02.2wo/chi_tra.unicharset --script_dir langdata_lstm --words langdata_lstm/chi_tra/chi_tra.wordlist --numbers langdata_lstm/chi_tra/chi_tra.numbers --puncs langdata_lstm/chi_tra/chi_tra.punc --output_dir train --lang chi_tra Loaded unicharset of size 2849 from file C:/Users/joe_c/AppData/Local/Temp/chi_tra-2022-08-02.2wo/chi_tra.unicharset Setting unichar properties Mirror 〕 of 〔 is not in unicharset Setting script properties Warning: properties incomplete for index 2515 = , Config file is optional, continuing... Null char=2 Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg

=== Saving box/tiff pairs for training data === Moving /tmp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.box to train Moving /tmp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.tif to train

=== Moving lstmf files for training data === Moving /tmp/chi_tra-2022-08-02.2wo/chi_tra.PMingLiU_Book.exp0.lstmf to train

Created starter traineddata for LSTM training of language 'chi_tra'

Run 'lstmtraining' command to continue LSTM training for language 'chi_tra'

Extracting tessdata components from tessdata/chi_tra.traineddata Wrote train/chi_tra.lstm Version: 0:config:size=2043, offset=192 17:lstm:size=12165163, offset=2235 18:lstm-punc-dawg:size=2602, offset=12167398 19:lstm-word-dawg:size=435354, offset=12170000 20:lstm-number-dawg:size=82, offset=12605354 21:lstm-unicharset:size=295682, offset=12605436 22:lstm-recoder:size=84529, offset=12901118 23:version:size=151, offset=12985647 Must provide a --traineddata see training documentation Must provide a --traineddata see training documentation

Nothing in the output folder

Expected Behavior:

Can build the trainneddate

Suggested Fix:

joecheung051 avatar Aug 02 '22 08:08 joecheung051