tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Continuation of an interrupted training

Open amitdo opened this issue 2 years ago • 4 comments

From issue #3560:

@stweil commented,

There are other training aspects which I consider more important.

One is continuation of an interrupted training. That should start with the line following the last one which was used for training. I'm afraid that it currently starts with the first line, so a training which was interrupted is different compared to a training which runs without any interrupt.

amitdo avatar Nov 01 '22 06:11 amitdo

Is is about interrupted trainings as in the user hits ctrl+Z (or ctrl+c)? Or, does the issue includes all types of interruptions such as training in step wise (first round 10000 iterations, then, the next round 100000 iterations etc)?

I hope the issue affects only in cases that are explicitly interrupted by the user. If it affects the latter case as well, we should have been warned because it has been a rule of thumb to train incrementally (slowly increasing the iterations).

DesBw avatar Oct 29 '23 13:10 DesBw

It always starts with the first training data, no matter how it terminated before. Therefore I now always use epochs instead of iterations. Other OCR software always uses epochs and also saves the intermediate models after each epoch. That's what I do when training Tesseract models, too.

So if you have a large data set for training, start with the first epoch (--max_iterations -1), save the final model, restart the training for the 2nd epoch (--max_iterations -2), save the final model again and so on.

stweil avatar Oct 30 '23 13:10 stweil

Well, I didn't know that. I have been reading different forums and people are just saying I can start from a small number of iterations and slowly increase it. This is absolutely very important information: need to be told for anyone trying to train Tesseract.

  • That means, we have been repeatedly training the same lines: leaving out others. I now understand why I was getting 0% error rate before I run as many iterations as the number of texts.

Thank you clarification. I hope you guys will improve tesseract for future releases so that it will continue where it stopped from. (OR sb can write a simple python script to continue from where it stopped if that is possible. I have been using a Python script to resume text2image from where it stopped. )

DesBw avatar Oct 30 '23 14:10 DesBw

One way to do it is to add number sequences (as prefixes or suffxes) to the lstm files. That is to mean, make the tool that generates the lstm (training) files to add a prefix or suffix. Then, we can use external scripts to control the starting and stopping files based on those numbers.

The following python script does exactly similar thing: but for the generation of the box and tif files (text2image): using the line numbers in the text file.

import os
import random
import pathlib
import subprocess
import argparse
from FontList import FontList

def create_training_data(training_text_file, font_list, output_directory, start_line=None, end_line=None):
    lines = []
    with open(training_text_file, 'r') as input_file:
        lines = input_file.readlines()

    if not os.path.exists(output_directory):
        os.mkdir(output_directory)

    if start_line is None:
        start_line = 0

    if end_line is None:
        end_line = len(lines) - 1

    for font_name in font_list.fonts:
        for line_index in range(start_line, end_line + 1):
            line = lines[line_index].strip()

            training_text_file_name = pathlib.Path(training_text_file).stem

            line_serial = f"{line_index:d}"

            line_gt_text = os.path.join(output_directory, f'{training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}.gt.txt')


            with open(line_gt_text, 'w') as output_file:
                output_file.writelines([line])

            file_base_name = f'{training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}'
            subprocess.run([
                'text2image',
                f'--font={font_name}',
                f'--text={line_gt_text}',
                f'--outputbase={output_directory}/{file_base_name}',
                '--max_pages=1',
                '--strip_unrenderable_words',
                '--leading=36',
                '--xsize=3600',
                '--ysize=330',
                '--char_spacing=1.0',
                '--exposure=0',
                '--unicharset_file=langdata/foo.unicharset',
            ])

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--start', type=int, help='Starting line count (inclusive)')
    parser.add_argument('--end', type=int, help='Ending line count (inclusive)')
    args = parser.parse_args()

    training_text_file = 'langdata/foo.training_text'
    output_directory = 'tesstrain/data/foo-ground-truth'

    font_list = FontList()

    create_training_data(training_text_file, font_list, output_directory, args.start, args.end)

```

With this script, the user can explicitly control the starting and ending lines. 
The same kind of system can be implemented if we have number sequences on the lstm (training) files. 

DesBw avatar Oct 31 '23 06:10 DesBw