tesstrain
tesstrain copied to clipboard
Bad box coordinates in boxfile string!
I have prepared the following ground truth files:
../tesstrain/data/Chechen-ground-truth
|-- 1.box
|-- 1.gt.txt
|-- 1.png
|-- 10.box
|-- 10.gt.txt
|-- 10.png
|-- 11.box
|-- 11.gt.txt
|-- 11.png
|-- 12.box
|-- 12.gt.txt
|-- 12.png
The box files are based on WordStr, here is the content of the file 1.box
for example:
WordStr 65 61 1556 254 0 #НЕКЪАШ А
65 61 1556 254 0
In the file 1.gt.txt
I then have the corresponding text:
НЕКЪАШ А
And here is the image:
Running the command make training MODEL_NAME=Chechen START_MODEL=rus TESSDATA=../tesseract/tessdata
, gives me an Error:
set -x; \
tesseract "data/Chechen-ground-truth/1.png" data/Chechen-ground-truth/1 --psm 13 lstm.train
+ tesseract data/Chechen-ground-truth/1.png data/Chechen-ground-truth/1 --psm 13 lstm.train
Bad box coordinates in boxfile string! 65 61 1556 254 0
No block overlapping textline: НЕКЪАШ А
Failed to read pages from data/Chechen-ground-truth/1.png
Error during processing.
make: *** [Makefile:258: data/Chechen-ground-truth/1.lstmf] Error 1
I'm usin tesseract version 5.3.0
Please have a look at https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip how to prepare custom data for training.
@zdenop thanks for your reply, this data does not provide any box files at all, how does tesseract know which character is which?
Did you try to follow the instructions on https://github.com/tesseract-ocr/tesstrain/? As far as I see there is no instruction about creating box files ;-)
@zdenop Thanks, after I removed the *.box
files from the Ground Truth folder, the training could start, but the first step (stage) of the training (tesstrain
-script) was to create the box files. And the coordinates look wired to me. Here is the example of a box file that tesstrain
generated for me:
Н 0 0 209 43 0
Е 0 0 209 43 0
К 0 0 209 43 0
Ъ 0 0 209 43 0
А 0 0 209 43 0
Ш 0 0 209 43 0
0 0 209 43 0
А 0 0 209 43 0
0 0 209 43 0
This was generated for the following image:
And I only put the files *.png
and *.gt.txt
in the Ground Truth folder, my 1.gt.txt
content was:
НЕКЪАШ А
I just wonder how it works and if there is an article about this process, I have not found anything about version 5 and it seems relatively new, right? But there are a lot of tutorials and examples for version 4, but they are different and the process is also different.
p.s. the model created after the training was able to recognize characters it did not recognize before the training (I just used the model rus.traindata
before and trained it further)
Did you read and follow https://github.com/tesseract-ocr/tesstrain? Where is written that the first stage is to create box files?
Did you read and follow https://github.com/tesseract-ocr/tesstrain?
yes
Where is written that the first stage is to create box files?
@zdenop no, tesstrain first created the *.box files itself and it is not mentioned in tesstrain's readme.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.