tesstrain explicate .lstm-unicharset and my.unicharset prereqs for finetuning

(because training fails if a .unicharset has already been created previously, but for a different START_MODEL)

Jun 06 '21 16:06 bertsky

Wait. What if the original code did not target *.lstm-unicharset, but *.unicharset (or both)?

Jun 14 '21 10:06 bertsky

Wait. What if the original code did not target *.lstm-unicharset, but *.unicharset (or both)?

Not relevant: we can only fine-tune from LSTM models, not (purely) Omnifont models.

Jun 14 '21 10:06 bertsky

But there's an additional issue previously unnoticed: PROTO_MODEL's combine_lang_model recipe expects to see $(DATA_DIR)/*.unicharset for every get_script_from_script_id in the unicharset table, i.e. {Common,Latin,Greek,Cyrillic,Hebrew}.unicharset, and an obscure Inherited.unicharset. But the current makefile (master and PR version) fails to provide these!

That leads to warnings like the following:

Failed to load script unicharset from:/home/kmw/nfs/gt-rücklauf/Latin.unicharset
Warning: properties incomplete for index 3 = M
Warning: properties incomplete for index 4 = A
Warning: properties incomplete for index 5 = T
Warning: properties incomplete for index 6 = I
Warning: properties incomplete for index 7 = O
Warning: properties incomplete for index 8 = ,
...

I do not know whether this is harmful, but we should try to explicate all rules necessary to put these files into $(DATA_DIR).

Jun 14 '21 10:06 bertsky

I have no idea how to generate these files (except extracting from their respective script models).

@stweil, your published data directories do contain such files – did you put them there by hand, or could they come from some old tesstrain_utils.sh intermediates?

Jun 14 '21 11:06 bertsky

Perhaps we are missing the original set_unicharset_properties rule, which enriches the generated unicharset for the model?

Jun 14 '21 11:06 bertsky

I have no idea how to generate these files (except extracting from their respective script models).

@stweil, your published data directories do contain such files – did you put them there by hand, or could they come from some old tesstrain_utils.sh intermediates?

I copied them from https://github.com/tesseract-ocr/langdata_lstm (or used local symbolic links to a local copy of that repository). That fixes most warnings (all but Inherited.unicharset).

Jun 14 '21 11:06 stweil

I copied them from https://github.com/tesseract-ocr/langdata_lstm (or used local symbolic links to a local copy of that repository). That fixes most warnings (all but Inherited.unicharset).

Oh, I see! But how could that have been forgotten in ocrd-train / tesstrain? Should we simply document this requirement, or fix this automatically by including a subrepo?

Jun 14 '21 11:06 bertsky

langdata_lstm is not a small repository, so I don't like the idea of having it as a subrepository.

Documenting the requirement could be a first step. Parsing the unicharset to find out which scripts are required and fetching the related files from the web if they are missing locally would be the better solution.

Jun 14 '21 11:06 stweil

langdata_lstm is not a small repository, so I don't like the idea of having it as a subrepository.

Documenting the requirement could be a first step. Parsing the unicharset to find out which scripts are required and fetching the related files from the web if they are missing locally would be the better solution.

Agreed. But perhaps we could live without the extra effort of parsing the exact requirements, since the unicharset files themselves are quite small.

Since there's already a wget of https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt (and of tessdata_best|fast/eng.traineddata), I opt for a fully automatic solution based on downloads and will add a commit here (or in a new PR?).

Jun 14 '21 11:06 bertsky

Done. Please re-review!

Jun 14 '21 12:06 bertsky

Done. Please re-review!

Or should we place all *.unicharset and radical-stroke.txt into a subdirectory langdata to keep DATA_DIR tidy? (Would only need to change the script_dir argument ...)

Jun 14 '21 21:06 bertsky

Or should we place all *.unicharset and radical-stroke.txt into a subdirectory langdata to keep DATA_DIR tidy? (Would only need to change the script_dir argument ...)

Let's do this! That way, if someone already had the complete https://github.com/tesseract-ocr/langdata checked out locally, one could simply copy/symlink it here, or point the LANGDATA_DIR to the right spot. And all these *.unicharset do look quite messy lying about in DATA_DIR...

Jun 24 '21 14:06 bertsky

Done. I have also updated from master to manually resolve the conflict, and added two minor improvements to the rules for all-gt / all-lstmf.

Jun 24 '21 21:06 bertsky

There was some additional fallout to the all-lstmf / all-gt speedups (by not repeating find): with large directories, the paste recipe would quickly run into E2BIG (because not all command-line arguments fit one memory page). This is a long-standing, nasty bug in make, for which the only workaround seems to be using make's file function – and which I did manage to apply here.

Also added a new target charfreq, showing the character histogram of all .gt.txt files.

Jun 27 '21 23:06 bertsky

@bertsky, it would help me a lot if you could make separate pull requests for your commits instead of adding more and more commits to this one. That also increases the chance that the pull requests can be reviewed and merged in time.

Jun 28 '21 06:06 stweil

it would help me a lot if you could make separate pull requests for your commits instead of adding more and more commits to this one. That also increases the chance that the pull requests can be reviewed and merged in time.

As already explained above, the commits after your first review are all necessary (like the manual merge against conflicts in upstream master) and related (except for the very last commit). They are also all trivial.

I cannot see how splitting this PR up could improve anyone's productivity.

Sep 06 '21 10:09 bertsky

@stweil this needs to be merged – please review

Jun 14 '22 13:06 bertsky

This includes essential fixes and has been hanging here for over a year for no reason. Any objections to merging?

Nov 11 '22 11:11 bertsky

tesstrain tesstrain copied to clipboard

explicate .lstm-unicharset and my.unicharset prereqs for finetuning

tesstrain
tesstrain copied to clipboard