tesseract
tesseract copied to clipboard
BCER eval displayed during lstmtraining and that from lstmeval are different
While trying to plot the error rates for training, I have come across an anomaly.
I use the LOG file generated from messages output during lstmtraining run, which also out BCER eval
on completion of evaluation using the eval list. This message displays the error rates as well as the learning iteration.
I separately run lstmeval on various checkpoint traineddata files to get the error rates.
I have found that BCER eval displayed during lstmtraining and that from lstmeval are different for the same `learning teration'.
Below are plots from a recent training, trying to add superscripts to the English traineddata.
The first is the chart generated from the lstmtraining log file, plotting the BCER every 100 iterations, every checkpoint and every eval. Since BCER eval
only reports the learning iteration value, the main x axis plots learning iterations.
The second chart plots the CER values calculated for fast traineddata files for checkpoints with BCER less than 1% using lstmeval, ISRI OCR evaluation and OCRevaluation in addition to BCER from lstmtraining at every 100 iterations and every checkpoint. It uses training iterations as the main x axis and hence does not include the eval done during training (included in the chart above).
Here is the tsv file for eval during lstmtraining:
Name | CheckpointCER | LearningIteration | TrainingIteration | EvalCER | IterationCER | SubtrainerCER |
---|---|---|---|---|---|---|
197735 | 0.750366 | |||||
199063 | 0.687183 | |||||
200061 | 0.628007 | |||||
201356 | 0.794562 | |||||
202510 | 0.916657 | |||||
204432 | 0.882159 | |||||
210088 | 0.780537 | |||||
211706 | 0.798433 | |||||
212529 | 0.816408 | |||||
238558 | 0.703124 | |||||
240123 | 0.788989 |
Edited to a subset of tsv file.
Here is the tsv file with lstmeval BCER for checkpoints with BCER less than 1%.
Name | CheckpointCER | LearningIteration | TrainingIteration | EvalCER | IterationCER | SubtrainerCER |
---|---|---|---|---|---|---|
145368 | 766900 | 1.017711 | ||||
161436 | 921200 | 0.722449 | ||||
161990 | 926800 | 0.744561 | ||||
161997 | 926900 | 0.746143 | ||||
170477 | 1012400 | 0.676673 | ||||
201343 | 1356000 | 0.821650 | ||||
201356 | 1356200 | 0.811972 |
Edited to a subset of tsv file.
So, for the checkpoint with minimal BCER during training, the values are:
checkpoint training iterations = 1356200 checkpoint learning iterations = 201356 checkpoint BCER = 0.110
lowest eval learning iterations = 200061 lowest eval BCER = 0.628007
eval learning iterations = 201356 eval BCER = 0.794562
lstmeval training iterations = 1356200 lstmeval learning iterations = 201356 lstmeval BCER = 0.811972
As shown above, there is a difference in the lstmeval BCER (0.811972) compared to the BCER eval reported for the same number of learning iterations (0.794562). In my opinion, both should give the same result.
So, there seems to be some error in the reporting of learning iterations # for BCER eval. Maybe it would help, if both training iterations and learning iterations are reported for BCER eval done during lstmtraining.
Do both lstmtraining and lstmeval evaluate exactly the same subset of images+ground truth and in the same order in each evaluation cycle?
Both evaluate the same subset of images+ground truth, the ones listed in list.eval. training is done on list.train. lstmeval run on checkpoints uses list.eval and goes through it sequentially. I do not know how the order of files for eval is decided during lstmtraining.
I also do not know what is the learning iteration number reported by eval during lstmtraining, i.e. whether it is the current learning iteration number when the eval is being reported or whether it is the saved iteration number from when eval was started
having the same issue, this might be related to https://github.com/tesseract-ocr/tesstrain/issues/110 ?
I can confirm this is still an issue. In my case, the difference is much worse:
plot_log | plot_cer |
---|---|
As you can see, the lstmeval BCER is close to 100%, while the lstmtraining BCER is around 11%.
If I replace fast
models with best
models for checkpoint extraction in the rules for make plot
, then the difference becomes benign:
plot_log | plot_cer |
---|---|
So for me the observation that convert_to_int
is the culprit seems true.
Another example pointing in a similar direction (different Tesseract/Tesstrain installation, different data):
plot_log | plot_cer |
---|---|
And again with best instead of fast extraction:
plot_log | plot_cer |
---|---|
Note that if I apply the models directly with the tesseract CLI, I can reproduce the behaviour shown in the plots – results are gibberish with the fast models, but ok with the best models.
The net spec for 'best' and 'fast' is not the same.
Any 'fast' model was converted to int from a float model (but not from 'best'). The float models that were the origin of the 'fast' models were never released publicly.
https://github.com/tesseract-ocr/tessdoc/blob/441f1ea328421e/Data-Files-in-tessdata_best.md#version-string--40000alpha--network-specification-for-tessdata_best
https://github.com/tesseract-ocr/tessdoc/blob/441f1ea328421e/Data-Files-in-tessdata_fast.md
@amitdo
The net spec for 'best' and 'fast' is not the same.
That's not true. The VGSL spec / net_mode is the same, only the extraction method differs.
Anyway, that's irrelevant here, since the problem appears independent of where the training started (pretrained models in the tessdata repos or from scratch). The relevant difference is the checkpoint extraction method.
BTW, I am not saying fast
always behaves like this, it's still somewhat surprising. I guess it depends on the course taken during lstmtraining – perhaps subtrainer feedback or other events. But users should avoid the fast
method for now to be on the safe side IMO.
@Shreeshrii have you by any chance noticed any particular event in the training log which we can use to track this down?