tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

use norm_mode 1 as default

Open bertsky opened this issue 3 years ago • 9 comments

Not sure if this is related to #53: why does the current default NORM_MODE set 2 for non-Indic, non-RTL languages? Shouldn't this be 1?

Also, the decision tree looks quite different than the corresponding one in tesseract/src/training/language-specific.sh. Does anyone know how that came to be?

bertsky avatar Jun 03 '21 11:06 bertsky

Plus (just to be sure): Am I correct in assuming that under 2, combining characters get recoded as extra symbol, whereas under 1 they are merged with the base character?

bertsky avatar Jun 03 '21 11:06 bertsky

Decision seems to derive from here: c90cd3f27acbacc8d30db1b44d1c017aecc7bf20

@wrznr could you please elaborate on the kind of feedback you gave (or link to it)?

bertsky avatar Jun 04 '21 08:06 bertsky

@wrznr could you please elaborate on the kind of feedback you gave (or link to it)?

answer (on other channel): here – a simple question.

IMHO the response should have been to rethink the old default in tesstrain, not immediately revert to it.

As stated above, Tesseract's own default used to be 1 in src/training/language-specific.sh. But that file (along with all other shell scripts) has very recently been removed from tesseract by @stweil. It now resides here:

https://github.com/tesseract-ocr/tesstrain/blob/1d8238684fe81e600431e5bdfe7dd24fbeaaf9f9/src/training/language_specific.py#L1373

So, again, why not 1 by default, and is my interpretation regarding combining characters correct?

bertsky avatar Jun 04 '21 09:06 bertsky

@Shreeshrii it seems the original deviation regarding --norm_mode default came from changes proposed by you (introducing finetuning here). Could you please elaborate on your choice?

bertsky avatar Jun 04 '21 09:06 bertsky

@bertsky It looks like the initial PR by @kba and myself was not prepared carefully enough. We took norm_mode for granted where we should have inspected its semantics in greater detail. When @Shreeshrii later tried to correct this with the option for setting the norm_mode in a sensible way, some misunderstandig occurred leading to the current suboptimal setting. It would be great if we could fix this now that you had have the chance to dive deeper into the consequences of this parameter setting.

wrznr avatar Jun 04 '21 11:06 wrznr

Sure, I'll prepare a PR, but will first do some (training and) testing.

bertsky avatar Jun 04 '21 11:06 bertsky

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 08 '21 02:07 stale[bot]

Sure, I'll prepare a PR, but will first do some (training and) testing.

@bertsky Did you find anything different in testing? I see that the Norm Mode is still 2 for non-indic and non-RTL languages one year after this conversation.

giri-kum avatar Jul 24 '22 23:07 giri-kum

@giri-kum sry, I don't remember anymore. I think I had some results, but inconclusive due to other problems. Isolated experiment should not be difficult to set up (run 10-20 trainings with mode 1 and mode 2, evaluate valset with external, true CER measurement, compare averages, perhaps repeat with different GT sets or languages), but I don't have the time right now.

bertsky avatar Aug 12 '22 15:08 bertsky