tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Reproducible training and evaluation

Open stweil opened this issue 4 years ago • 12 comments

Currently the list is created from all lstmf files by using sort -R which creates a randomly sorted list. That solution gives results which cannot be reproduced which is unwanted for several reasons.

Using a pseudo random sort with a known seed which can optionally be set by the user would fix this.

stweil avatar Aug 13 '19 09:08 stweil

https://github.com/stweil/ocrd-train/commit/32b63c5c28173d7f7ac2cf23da95714e23045190 implements a possible solution. It is also faster.

stweil avatar Aug 15 '19 14:08 stweil

From @wrznr (see #72):

There is a way to pass seeds to sort -R: seq 1 100 | sort -R --random-source=<(openssl enc -aes-256-ctr -pass pass:"42" -nosalt </dev/zero 2>/dev/null) where 42 is the seed. This may make the additional script superfluous.

Thank you for the hint. As far as I see there is no need for openssl. This works, too:

seq 1 100 | sort -R --random-source=<(echo 42)

stweil avatar Aug 15 '19 15:08 stweil

Even better. Although, python may be faster, I'd like to stick with coreutils.

wrznr avatar Aug 15 '19 15:08 wrznr

It only makes a difference of about 30 s on a large data set with 300000 lines, so sticking to coreutils is reasonable. Done in PR #76.

stweil avatar Aug 15 '19 15:08 stweil

There remains a problem:

The code currently shuffles the result lines from find. Those lines are unsorted. So they must be sorted before they are shuffled to get a reproducible result.

stweil avatar Aug 16 '19 08:08 stweil

Pull request #77 addresses that additional problem.

stweil avatar Aug 16 '19 08:08 stweil

This works, too

The simple code without openssl works on MacOS (where I had run that test), but fails on Linux. I added a commit to PR #77 to fix that.

stweil avatar Aug 17 '19 14:08 stweil

The lists for training and evaluation are now reproducible. First tests with the latest code show that the training results are nevertheless not reproducible: a comparison of the same training running on one machine with latest Tesseract . This needs more examination. I therefore updated the subject of this issue to cover that.

My current hypothesis is that the different results might be caused by differences in the calculation of the dot product. To test this, the same training must be run with identical Tesseract versions on different machines with similar hardware (AVX support). Maybe Tesseract must also be restricted to a single thread.

stweil avatar Aug 23 '19 08:08 stweil

Very, very interesting.

wrznr avatar Aug 23 '19 08:08 wrznr

You can try with --sequential_training to check if that helps.

Shreeshrii avatar Aug 28 '19 12:08 Shreeshrii

The rule which builds the unicharset from the box files also uses find. As that command does not sort its findings, the entries in the resulting unicharset have more or less a random order.

stweil avatar Oct 03 '19 12:10 stweil

With sorted unicharset and disabled OpenMP, the training process seems to be reproducible at least on the same machine. I still have to see whether enabled OpenMP would also give reproducible results.

I tried the same training also on a different host. While it worked nice on the Intel server with a CER of 14.47 % after 10000 iterations, it did not on a 64 bit ARM system where the CER did not drop below 100 %. That's really strange and needs more examination

stweil avatar Oct 03 '19 20:10 stweil