tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Training with python: run training step ?

Open forzagreen opened this issue 1 year ago • 1 comments

As mentioned by @stefan6419846 in https://github.com/madmaze/pytesseract/issues/508 , there is a python wrapper for training in tesstrain/src/ , which unfortunately is not documented in tesseract, tessdoc and tesstrain repositories.

From my understanding: (please correct me if I'm wrong)

  1. It only generates lstmf files, and does not perform any training. In the steps mentioned in Overview of Training Process, it stops at step 5. Steps 6 and 7 must be done separately. Is that correct ?

  2. How to perform steps 6 and 7 ? with Makefile commands ? if you give me some inputs, I can help adding these steps to the python script.

  3. The python script takes a TEXTFILE and generates (for each font) box/tif/lstmf files for the hole text, not line by line. So, in order to generate line by line, we must run the script for each one-line file ?

Thanks in advance !

Cc: @stefan6419846

forzagreen avatar Sep 18 '23 07:09 forzagreen

tesstrain basically creates artificial training data for doing finetuning with a specific font for example. You might find some existing examples using the old tesstrain.sh script which should be roughly equivalent for tesstrain. The Makefile approach is for "real" data only.

Rough steps for the Python module:

  1. Extract LSTM file: combine_tessdata -e tessdata/eng.traineddata eng.lstm

  2. Generate files:

    tesstrain.run(
        fonts_directory=fonts_directory,
        fonts=[font_name],
        language_code='eng',
        linedata_only=True,
        langdata_directory=language_data_directory,
        tessdata_directory=tessdata_directory,
        save_box_tiff=True,
        maximum_pages=maximum_pages,
        output_directory=output_directory
    )
    
  3. Finetune: lstmtraining --continue_from eng.lstm --model_output font_name --traineddata tessdata/eng.traineddata --train_listfile eng.training_files.txt --max_iterations 10

  4. Convert to .traineddata file: lstmtraining --stop_training --continue_from font_name_checkpoint --traineddata tessdata/eng.traineddata --model_output target_path

stefan6419846 avatar Sep 18 '23 08:09 stefan6419846