tesstrain Feature Request: list.train and list.eval from different folders

Current implementation creates all-lstmf from the foo-ground-truth directory and splits it into two in the specified ratio by using the head and tail commands.

The disadvantage with this approach is that when there are a limited number of samples of some characters in the training data, there is no way to control that they are evenly divided in the training and eval group. So, it is quite possible that some characters may not be used for training at all.

I suggest letting the user specify two directories, one with training data and one with testing data.

Additionally, It would be great to split the testing data further into two groups for eval and validation. One of the changes in PR#207 does this split using the existing approach using head and tail. EDIT: see https://github.com/tesseract-ocr/tesstrain/pull/217

Dec 14 '20 10:12 Shreeshrii

Current implementation creates all-lstmf from the foo-ground-truth directory and splits it into two in the specified ratio by using the head and tail commands.

The disadvantage with this approach is that when there are a limited number of samples of some characters in the training data, there is no way to control that they are evenly divided in the training and eval group. So, it is quite possible that some characters may not be used for training at all.

I suggest letting the user specify two directories, one with training data and one with testing data.

Additionally, It would be great to split the testing data further into two groups for eval and validation. One of the changes in PR#207 does this split using the existing approach using head and tail. EDIT: see #217

Hi, could you specify which command does this? : "Current implementation creates all-lstmf from the foo-ground-truth directory ", Thanks!

Dec 20 '20 23:12 becZzZhao

could you specify which command does this?

make lists --trace should show you all the commands executed for making the lists.

Dec 21 '20 02:12 Shreeshrii

could you specify which command does this?

make lists --trace should show you all the commands executed for making the lists.

Thanks!

Dec 22 '20 12:12 becZzZhao

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 23 '21 05:01 stale[bot]

https://groups.google.com/g/tesseract-ocr/c/HFpYH5i7VRw/m/72tnGgCmDAAJ

Question regarding use of custom list.train and list.eval

Jan 23 '21 05:01 Shreeshrii

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Feb 22 '21 14:02 stale[bot]

It is always possible to create custom list.train and list.eval and use those instead of the ones created by the Makefile.

Feb 22 '21 14:02 stweil

It is always possible to create custom list.train and list.eval and use those instead of the ones created by the Makefile.

It could be documented, though.

However, there's a big catch: the timestamp is important; if your manual list.train and list.eval are older than any of the *.gt.txt (or derived *.lstmf), then they will be overwritten by the next make. So perhaps we should offer some explicit manual override?

Jun 03 '21 11:06 bertsky

tesstrain tesstrain copied to clipboard

Feature Request: list.train and list.eval from different folders

tesstrain
tesstrain copied to clipboard