tessdata_best
tessdata_best copied to clipboard
Source of scripts/Fraktur etc.
While the files in the top directory seem to come from the sources in the langdata repository, the source for some of the files in scripts/ is unclear:
scripts/Fraktur.traineddatahas no matching file in langdata,scripts/Japanese.traineddataalso, etc.
The Data-Files wiki article does not mention scripts/Fraktur.
This adds to the confusion of the frk language (not actually frankish, but Fraktur), the Fraktur script and the legacy model deu_frak in the tessdata repository.
See
https://github.com/tesseract-ocr/tessdata_fast/blob/master/README.md
https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Fraktur.langs.txt
Also see https://github.com/tesseract-ocr/tessdata/issues/65
Is langdata obsolete as langdata_lstm exists?
langdata files are appropriate for tesseract 3 or for legacy/base versions using tesseract 4. They can also be used for finetuning which requires a smaller input training text.
As @Shreeshrii already said, langdata_lstm is for LSTM models while langdata is for legacy models. Both kinds of models are still used.
The scriptmodels are mixtures of different languages. script/Fraktur for example combines enm+frm+frk+ita_old+spa_old.
I fixed the description for 4.00 frk in the Wiki. The other Wiki issues are still open.
Fraktur Tesseract OCR is what I am looking for,.... I installed VietOCR v5.5.2 and Tesseract 4.1.0 on my mac, and now I am trying to find help on how to train it better.... there are too many OCR errors...
How would I go about training the software? Can anyone help?
I am a total retard, ...sadly,.... and I do not even know how I was able to install the two components so far..... and this training step is nowhere explained
Any help into the right direction would be greatly appreciated
In the meantime newer Fraktur models are available. There is a description of the training process for those models in the Wiki.
As soon as the training is finished, I'll add the results to tessdata_contrib.
@mikegerber, can we close this issue?