tesstrain
tesstrain copied to clipboard
How to do incremental-training on tesseract-ocr?
I'm working with tesseract-4.1.1
and trying to do training (fine-tuning)
for this I have followed steps:
-
Downloaded
eng.traineddata
fromtessdata_best
and pasted it into/usr/share/tesseract-ocr/4.00/tessdata
. -
Then I've created image-crops using
craft-text-detector
in python and made ground-truths(.gt.txt)
for each image crop. -
Then cloned git clone
https://github.com/tesseract-ocr/ocrd-train.git
and then cdocrd-train
. -
Inside
ocrd-train/data
folder, I've createdmy-model-ground-truth
folder and pasted.png
and.gt.txt
files in it. -
Then I ran command
make tesseract-langdata
on terminal. -
At last I ran command
make training MODEL_NAME=my-model MAX_ITERATIONS=20000 PSM=7 FINETUNE_TYPE=Impact DEBUG_INTERVAL=-1 START_MODEL=eng TESSDATA=/usr/share/tesseract-ocr/4.00/tessdata/
Above procedure took some time, and I got my-model.traineddata
file in ocrd-train/data/
. I've pasted that file in /usr/share/tesseract-ocr/4.00/tessdata
and it is giving results better than eng.traineddata
.
For above training I used 20 images, now I want to do incremental-training. I want to train 30 more images on previously trained my-model.traineddata
. Here I'm confused because after completion of previous training there are some folder in ocrd-train/data/
:
-
my-model (folder)
-
my-model-ground-truth (folder)
-
eng (folder)
-
langdata (folder)
-
my-model.traineddata (file)
Now what should I do for incremental-training?
Do I only need to remove files in my-model-ground-truth and paste new .png
and .gt.txt
files of 30 images, and use my-model
as START_MODEL
?
Or I need to remove other folders as well?