tesstrain How to do incremental-training on tesseract-ocr?

How to do incremental-training on tesseract-ocr?

Open zaryabRiasat opened this issue 8 months ago • 5 comments

I'm working with tesseract-4.1.1 and trying to do training (fine-tuning) for this I have followed steps:

Downloaded eng.traineddata from tessdata_best and pasted it into /usr/share/tesseract-ocr/4.00/tessdata.
Then I've created image-crops using craft-text-detector in python and made ground-truths (.gt.txt) for each image crop.
Then cloned git clone https://github.com/tesseract-ocr/ocrd-train.git and then cd ocrd-train.
Inside ocrd-train/data folder, I've created my-model-ground-truth folder and pasted .png and .gt.txt files in it.
Then I ran command make tesseract-langdata on terminal.
At last I ran command make training MODEL_NAME=my-model MAX_ITERATIONS=20000 PSM=7 FINETUNE_TYPE=Impact DEBUG_INTERVAL=-1 START_MODEL=eng TESSDATA=/usr/share/tesseract-ocr/4.00/tessdata/

Above procedure took some time, and I got my-model.traineddata file in ocrd-train/data/. I've pasted that file in /usr/share/tesseract-ocr/4.00/tessdata and it is giving results better than eng.traineddata.

For above training I used 20 images, now I want to do incremental-training. I want to train 30 more images on previously trained my-model.traineddata. Here I'm confused because after completion of previous training there are some folder in ocrd-train/data/:

my-model (folder)
my-model-ground-truth (folder)
eng (folder)
langdata (folder)
my-model.traineddata (file)

Now what should I do for incremental-training?

Do I only need to remove files in my-model-ground-truth and paste new .png and .gt.txt files of 30 images, and use my-model as START_MODEL?

Or I need to remove other folders as well?

May 31 '24 07:05 zaryabRiasat

tesstrain tesstrain copied to clipboard

How to do incremental-training on tesseract-ocr?

tesstrain
tesstrain copied to clipboard