tesstrain
tesstrain copied to clipboard
Training with python: run training step ?
As mentioned by @stefan6419846 in https://github.com/madmaze/pytesseract/issues/508 , there is a python wrapper for training in tesstrain/src/ , which unfortunately is not documented in tesseract, tessdoc and tesstrain repositories.
From my understanding: (please correct me if I'm wrong)
-
It only generates lstmf files, and does not perform any training. In the steps mentioned in Overview of Training Process, it stops at step 5. Steps 6 and 7 must be done separately. Is that correct ?
-
How to perform steps 6 and 7 ? with Makefile commands ? if you give me some inputs, I can help adding these steps to the python script.
-
The python script takes a TEXTFILE and generates (for each font) box/tif/lstmf files for the hole text, not line by line. So, in order to generate line by line, we must run the script for each one-line file ?
Thanks in advance !
Cc: @stefan6419846
tesstrain
basically creates artificial training data for doing finetuning with a specific font for example. You might find some existing examples using the old tesstrain.sh
script which should be roughly equivalent for tesstrain
. The Makefile
approach is for "real" data only.
Rough steps for the Python module:
-
Extract LSTM file:
combine_tessdata -e tessdata/eng.traineddata eng.lstm
-
Generate files:
tesstrain.run( fonts_directory=fonts_directory, fonts=[font_name], language_code='eng', linedata_only=True, langdata_directory=language_data_directory, tessdata_directory=tessdata_directory, save_box_tiff=True, maximum_pages=maximum_pages, output_directory=output_directory )
-
Finetune:
lstmtraining --continue_from eng.lstm --model_output font_name --traineddata tessdata/eng.traineddata --train_listfile eng.training_files.txt --max_iterations 10
-
Convert to .traineddata file:
lstmtraining --stop_training --continue_from font_name_checkpoint --traineddata tessdata/eng.traineddata --model_output target_path