tesstrain
tesstrain copied to clipboard
Reproducible training and evaluation
Currently the list is created from all lstmf
files by using sort -R
which creates a randomly sorted list. That solution gives results which cannot be reproduced which is unwanted for several reasons.
Using a pseudo random sort with a known seed which can optionally be set by the user would fix this.
https://github.com/stweil/ocrd-train/commit/32b63c5c28173d7f7ac2cf23da95714e23045190 implements a possible solution. It is also faster.
From @wrznr (see #72):
There is a way to pass seeds to sort -R: seq 1 100 | sort -R --random-source=<(openssl enc -aes-256-ctr -pass pass:"42" -nosalt </dev/zero 2>/dev/null) where 42 is the seed. This may make the additional script superfluous.
Thank you for the hint. As far as I see there is no need for openssl
. This works, too:
seq 1 100 | sort -R --random-source=<(echo 42)
Even better. Although, python may be faster, I'd like to stick with coreutils
.
It only makes a difference of about 30 s on a large data set with 300000 lines, so sticking to coreutils is reasonable. Done in PR #76.
There remains a problem:
The code currently shuffles the result lines from find
. Those lines are unsorted. So they must be sorted before they are shuffled to get a reproducible result.
Pull request #77 addresses that additional problem.
This works, too
The simple code without openssl
works on MacOS (where I had run that test), but fails on Linux. I added a commit to PR #77 to fix that.
The lists for training and evaluation are now reproducible. First tests with the latest code show that the training results are nevertheless not reproducible: a comparison of the same training running on one machine with latest Tesseract . This needs more examination. I therefore updated the subject of this issue to cover that.
My current hypothesis is that the different results might be caused by differences in the calculation of the dot product. To test this, the same training must be run with identical Tesseract versions on different machines with similar hardware (AVX support). Maybe Tesseract must also be restricted to a single thread.
Very, very interesting.
You can try with --sequential_training
to check if that helps.
The rule which builds the unicharset
from the box files also uses find
. As that command does not sort its findings, the entries in the resulting unicharset have more or less a random order.
With sorted unicharset
and disabled OpenMP, the training process seems to be reproducible at least on the same machine. I still have to see whether enabled OpenMP would also give reproducible results.
I tried the same training also on a different host. While it worked nice on the Intel server with a CER of 14.47 % after 10000 iterations, it did not on a 64 bit ARM system where the CER did not drop below 100 %. That's really strange and needs more examination