tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Feature Request: list.train and list.eval from different folders

Open Shreeshrii opened this issue 4 years ago • 8 comments

Current implementation creates all-lstmf from the foo-ground-truth directory and splits it into two in the specified ratio by using the head and tail commands.

The disadvantage with this approach is that when there are a limited number of samples of some characters in the training data, there is no way to control that they are evenly divided in the training and eval group. So, it is quite possible that some characters may not be used for training at all.

I suggest letting the user specify two directories, one with training data and one with testing data.

Additionally, It would be great to split the testing data further into two groups for eval and validation. One of the changes in PR#207 does this split using the existing approach using head and tail. EDIT: see https://github.com/tesseract-ocr/tesstrain/pull/217

Shreeshrii avatar Dec 14 '20 10:12 Shreeshrii

Current implementation creates all-lstmf from the foo-ground-truth directory and splits it into two in the specified ratio by using the head and tail commands.

The disadvantage with this approach is that when there are a limited number of samples of some characters in the training data, there is no way to control that they are evenly divided in the training and eval group. So, it is quite possible that some characters may not be used for training at all.

I suggest letting the user specify two directories, one with training data and one with testing data.

Additionally, It would be great to split the testing data further into two groups for eval and validation. One of the changes in PR#207 does this split using the existing approach using head and tail. EDIT: see #217

Hi, could you specify which command does this? : "Current implementation creates all-lstmf from the foo-ground-truth directory ", Thanks!

becZzZhao avatar Dec 20 '20 23:12 becZzZhao

could you specify which command does this?

make lists --trace should show you all the commands executed for making the lists.

Shreeshrii avatar Dec 21 '20 02:12 Shreeshrii

could you specify which command does this?

make lists --trace should show you all the commands executed for making the lists.

Thanks!

becZzZhao avatar Dec 22 '20 12:12 becZzZhao

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 23 '21 05:01 stale[bot]

https://groups.google.com/g/tesseract-ocr/c/HFpYH5i7VRw/m/72tnGgCmDAAJ

Question regarding use of custom list.train and list.eval

Shreeshrii avatar Jan 23 '21 05:01 Shreeshrii

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Feb 22 '21 14:02 stale[bot]

It is always possible to create custom list.train and list.eval and use those instead of the ones created by the Makefile.

stweil avatar Feb 22 '21 14:02 stweil

It is always possible to create custom list.train and list.eval and use those instead of the ones created by the Makefile.

It could be documented, though.

However, there's a big catch: the timestamp is important; if your manual list.train and list.eval are older than any of the *.gt.txt (or derived *.lstmf), then they will be overwritten by the next make. So perhaps we should offer some explicit manual override?

bertsky avatar Jun 03 '21 11:06 bertsky