tesseract BCER eval displayed during lstmtraining and that from lstmeval are different

While trying to plot the error rates for training, I have come across an anomaly.

I use the LOG file generated from messages output during lstmtraining run, which also out BCER eval on completion of evaluation using the eval list. This message displays the error rates as well as the learning iteration.

I separately run lstmeval on various checkpoint traineddata files to get the error rates.

I have found that BCER eval displayed during lstmtraining and that from lstmeval are different for the same `learning teration'.

Mar 03 '22 12:03 Shreeshrii

Below are plots from a recent training, trying to add superscripts to the English traineddata.

The first is the chart generated from the lstmtraining log file, plotting the BCER every 100 iterations, every checkpoint and every eval. Since BCER eval only reports the learning iteration value, the main x axis plots learning iterations.

engSuper-LOG-2

The second chart plots the CER values calculated for fast traineddata files for checkpoints with BCER less than 1% using lstmeval, ISRI OCR evaluation and OCRevaluation in addition to BCER from lstmtraining at every 100 iterations and every checkpoint. It uses training iterations as the main x axis and hence does not include the eval done during training (included in the chart above).

engSuper-2

Mar 03 '22 14:03 Shreeshrii

Here is the tsv file for eval during lstmtraining:

Name	CheckpointCER	LearningIteration	TrainingIteration	EvalCER	IterationCER	SubtrainerCER
		197735		0.750366
		199063		0.687183
		200061		0.628007
		201356		0.794562
		202510		0.916657
		204432		0.882159
		210088		0.780537
		211706		0.798433
		212529		0.816408
		238558		0.703124
		240123		0.788989

Edited to a subset of tsv file.

Mar 03 '22 15:03 Shreeshrii

Here is the tsv file with lstmeval BCER for checkpoints with BCER less than 1%.

LearningIteration	TrainingIteration	EvalCER
145368	766900	1.017711
161436	921200	0.722449
161990	926800	0.744561
161997	926900	0.746143
170477	1012400	0.676673
201343	1356000	0.821650
201356	1356200	0.811972

Edited to a subset of tsv file.

Mar 03 '22 15:03 Shreeshrii

So, for the checkpoint with minimal BCER during training, the values are:

checkpoint training iterations = 1356200 checkpoint learning iterations = 201356 checkpoint BCER = 0.110

lowest eval learning iterations = 200061 lowest eval BCER = 0.628007

eval learning iterations = 201356 eval BCER = 0.794562

lstmeval training iterations = 1356200 lstmeval learning iterations = 201356 lstmeval BCER = 0.811972

As shown above, there is a difference in the lstmeval BCER (0.811972) compared to the BCER eval reported for the same number of learning iterations (0.794562). In my opinion, both should give the same result.

So, there seems to be some error in the reporting of learning iterations # for BCER eval. Maybe it would help, if both training iterations and learning iterations are reported for BCER eval done during lstmtraining.

Mar 03 '22 15:03 Shreeshrii

Do both lstmtraining and lstmeval evaluate exactly the same subset of images+ground truth and in the same order in each evaluation cycle?

Mar 13 '22 14:03 amitdo

Both evaluate the same subset of images+ground truth, the ones listed in list.eval. training is done on list.train. lstmeval run on checkpoints uses list.eval and goes through it sequentially. I do not know how the order of files for eval is decided during lstmtraining.

I also do not know what is the learning iteration number reported by eval during lstmtraining, i.e. whether it is the current learning iteration number when the eval is being reported or whether it is the saved iteration number from when eval was started

Mar 13 '22 15:03 Shreeshrii

having the same issue, this might be related to https://github.com/tesseract-ocr/tesstrain/issues/110 ?

Aug 01 '22 06:08 whisere

I can confirm this is still an issue. In my case, the difference is much worse:

plot_log	plot_cer

As you can see, the lstmeval BCER is close to 100%, while the lstmtraining BCER is around 11%.

If I replace fast models with best models for checkpoint extraction in the rules for make plot, then the difference becomes benign:

plot_log	plot_cer

So for me the observation that convert_to_int is the culprit seems true.

Another example pointing in a similar direction (different Tesseract/Tesstrain installation, different data):

plot_log	plot_cer

And again with best instead of fast extraction:

plot_log	plot_cer

Note that if I apply the models directly with the tesseract CLI, I can reproduce the behaviour shown in the plots – results are gibberish with the fast models, but ok with the best models.

Mar 30 '24 18:03 bertsky

The net spec for 'best' and 'fast' is not the same.

Any 'fast' model was converted to int from a float model (but not from 'best'). The float models that were the origin of the 'fast' models were never released publicly.

Mar 31 '24 02:03 amitdo

https://github.com/tesseract-ocr/tessdoc/blob/441f1ea328421e/Data-Files-in-tessdata_best.md#version-string--40000alpha--network-specification-for-tessdata_best

https://github.com/tesseract-ocr/tessdoc/blob/441f1ea328421e/Data-Files-in-tessdata_fast.md

Mar 31 '24 02:03 amitdo

@amitdo

The net spec for 'best' and 'fast' is not the same.

That's not true. The VGSL spec / net_mode is the same, only the extraction method differs.

Anyway, that's irrelevant here, since the problem appears independent of where the training started (pretrained models in the tessdata repos or from scratch). The relevant difference is the checkpoint extraction method.

BTW, I am not saying fast always behaves like this, it's still somewhat surprising. I guess it depends on the course taken during lstmtraining – perhaps subtrainer feedback or other events. But users should avoid the fast method for now to be on the safe side IMO.

@Shreeshrii have you by any chance noticed any particular event in the training log which we can use to track this down?

Mar 31 '24 16:03 bertsky

tesseract tesseract copied to clipboard

BCER eval displayed during lstmtraining and that from lstmeval are different

tesseract
tesseract copied to clipboard