tesstrain Reproducible training and evaluation

Reproducible training and evaluation

Open stweil opened this issue 4 years ago • 12 comments

Currently the list is created from all lstmf files by using sort -R which creates a randomly sorted list. That solution gives results which cannot be reproduced which is unwanted for several reasons.

Using a pseudo random sort with a known seed which can optionally be set by the user would fix this.

Aug 13 '19 09:08 stweil

https://github.com/stweil/ocrd-train/commit/32b63c5c28173d7f7ac2cf23da95714e23045190 implements a possible solution. It is also faster.

Aug 15 '19 14:08 stweil

From @wrznr (see #72):

There is a way to pass seeds to sort -R: seq 1 100 | sort -R --random-source=<(openssl enc -aes-256-ctr -pass pass:"42" -nosalt </dev/zero 2>/dev/null) where 42 is the seed. This may make the additional script superfluous.

Thank you for the hint. As far as I see there is no need for openssl. This works, too:

seq 1 100 | sort -R --random-source=<(echo 42)

Aug 15 '19 15:08 stweil

Even better. Although, python may be faster, I'd like to stick with coreutils.

Aug 15 '19 15:08 wrznr

It only makes a difference of about 30 s on a large data set with 300000 lines, so sticking to coreutils is reasonable. Done in PR #76.

Aug 15 '19 15:08 stweil

There remains a problem:

The code currently shuffles the result lines from find. Those lines are unsorted. So they must be sorted before they are shuffled to get a reproducible result.

Aug 16 '19 08:08 stweil

Pull request #77 addresses that additional problem.

Aug 16 '19 08:08 stweil

This works, too

The simple code without openssl works on MacOS (where I had run that test), but fails on Linux. I added a commit to PR #77 to fix that.

Aug 17 '19 14:08 stweil

The lists for training and evaluation are now reproducible. First tests with the latest code show that the training results are nevertheless not reproducible: a comparison of the same training running on one machine with latest Tesseract . This needs more examination. I therefore updated the subject of this issue to cover that.

My current hypothesis is that the different results might be caused by differences in the calculation of the dot product. To test this, the same training must be run with identical Tesseract versions on different machines with similar hardware (AVX support). Maybe Tesseract must also be restricted to a single thread.

Aug 23 '19 08:08 stweil

Very, very interesting.

Aug 23 '19 08:08 wrznr

You can try with --sequential_training to check if that helps.

Aug 28 '19 12:08 Shreeshrii

The rule which builds the unicharset from the box files also uses find. As that command does not sort its findings, the entries in the resulting unicharset have more or less a random order.

Oct 03 '19 12:10 stweil

With sorted unicharset and disabled OpenMP, the training process seems to be reproducible at least on the same machine. I still have to see whether enabled OpenMP would also give reproducible results.

I tried the same training also on a different host. While it worked nice on the Intel server with a CER of 14.47 % after 10000 iterations, it did not on a 64 bit ARM system where the CER did not drop below 100 %. That's really strange and needs more examination

Oct 03 '19 20:10 stweil

tesstrain tesstrain copied to clipboard

Reproducible training and evaluation

tesstrain
tesstrain copied to clipboard