tesstrain
tesstrain copied to clipboard
explicate .lstm-unicharset and my.unicharset prereqs for finetuning
(because training fails if a .unicharset has already been created previously, but for a different START_MODEL)
Wait. What if the original code did not target *.lstm-unicharset
, but *.unicharset
(or both)?
Wait. What if the original code did not target
*.lstm-unicharset
, but*.unicharset
(or both)?
Not relevant: we can only fine-tune from LSTM models, not (purely) Omnifont models.
But there's an additional issue previously unnoticed: PROTO_MODEL
's combine_lang_model
recipe expects to see $(DATA_DIR)/*.unicharset
for every get_script_from_script_id
in the unicharset table, i.e. {Common,Latin,Greek,Cyrillic,Hebrew}.unicharset
, and an obscure Inherited.unicharset
. But the current makefile (master and PR version) fails to provide these!
That leads to warnings like the following:
Failed to load script unicharset from:/home/kmw/nfs/gt-rücklauf/Latin.unicharset
Warning: properties incomplete for index 3 = M
Warning: properties incomplete for index 4 = A
Warning: properties incomplete for index 5 = T
Warning: properties incomplete for index 6 = I
Warning: properties incomplete for index 7 = O
Warning: properties incomplete for index 8 = ,
...
I do not know whether this is harmful, but we should try to explicate all rules necessary to put these files into $(DATA_DIR)
.
I have no idea how to generate these files (except extracting from their respective script models).
@stweil, your published data directories do contain such files – did you put them there by hand, or could they come from some old tesstrain_utils.sh
intermediates?
Perhaps we are missing the original set_unicharset_properties
rule, which enriches the generated unicharset
for the model?
I have no idea how to generate these files (except extracting from their respective script models).
@stweil, your published data directories do contain such files – did you put them there by hand, or could they come from some old
tesstrain_utils.sh
intermediates?
I copied them from https://github.com/tesseract-ocr/langdata_lstm (or used local symbolic links to a local copy of that repository). That fixes most warnings (all but Inherited.unicharset
).
I copied them from https://github.com/tesseract-ocr/langdata_lstm (or used local symbolic links to a local copy of that repository). That fixes most warnings (all but
Inherited.unicharset
).
Oh, I see! But how could that have been forgotten in ocrd-train / tesstrain? Should we simply document this requirement, or fix this automatically by including a subrepo?
langdata_lstm is not a small repository, so I don't like the idea of having it as a subrepository.
Documenting the requirement could be a first step. Parsing the unicharset to find out which scripts are required and fetching the related files from the web if they are missing locally would be the better solution.
langdata_lstm is not a small repository, so I don't like the idea of having it as a subrepository.
Documenting the requirement could be a first step. Parsing the unicharset to find out which scripts are required and fetching the related files from the web if they are missing locally would be the better solution.
Agreed. But perhaps we could live without the extra effort of parsing the exact requirements, since the unicharset files themselves are quite small.
Since there's already a wget
of https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt (and of tessdata_best|fast/eng.traineddata
), I opt for a fully automatic solution based on downloads and will add a commit here (or in a new PR?).
Done. Please re-review!
Done. Please re-review!
Or should we place all *.unicharset
and radical-stroke.txt
into a subdirectory langdata
to keep DATA_DIR
tidy? (Would only need to change the script_dir
argument ...)
Or should we place all
*.unicharset
andradical-stroke.txt
into a subdirectorylangdata
to keepDATA_DIR
tidy? (Would only need to change thescript_dir
argument ...)
Let's do this! That way, if someone already had the complete https://github.com/tesseract-ocr/langdata checked out locally, one could simply copy/symlink it here, or point the LANGDATA_DIR
to the right spot. And all these *.unicharset
do look quite messy lying about in DATA_DIR
...
Done. I have also updated from master to manually resolve the conflict, and added two minor improvements to the rules for all-gt / all-lstmf.
There was some additional fallout to the all-lstmf
/ all-gt
speedups (by not repeating find
): with large directories, the paste
recipe would quickly run into E2BIG
(because not all command-line arguments fit one memory page). This is a long-standing, nasty bug in make, for which the only workaround seems to be using make's file
function – and which I did manage to apply here.
Also added a new target charfreq
, showing the character histogram of all .gt.txt files.
@bertsky, it would help me a lot if you could make separate pull requests for your commits instead of adding more and more commits to this one. That also increases the chance that the pull requests can be reviewed and merged in time.
it would help me a lot if you could make separate pull requests for your commits instead of adding more and more commits to this one. That also increases the chance that the pull requests can be reviewed and merged in time.
As already explained above, the commits after your first review are all necessary (like the manual merge against conflicts in upstream master) and related (except for the very last commit). They are also all trivial.
I cannot see how splitting this PR up could improve anyone's productivity.
@stweil this needs to be merged – please review
This includes essential fixes and has been hanging here for over a year for no reason. Any objections to merging?