tessdata_best icon indicating copy to clipboard operation
tessdata_best copied to clipboard

Source of scripts/Fraktur etc.

Open mikegerber opened this issue 6 years ago • 8 comments
trafficstars

While the files in the top directory seem to come from the sources in the langdata repository, the source for some of the files in scripts/ is unclear:

  • scripts/Fraktur.traineddata has no matching file in langdata,
  • scripts/Japanese.traineddata also, etc.

The Data-Files wiki article does not mention scripts/Fraktur.

This adds to the confusion of the frk language (not actually frankish, but Fraktur), the Fraktur script and the legacy model deu_frak in the tessdata repository.

mikegerber avatar Jun 03 '19 11:06 mikegerber

See

https://github.com/tesseract-ocr/tessdata_fast/blob/master/README.md

https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Fraktur.langs.txt

Shreeshrii avatar Jun 03 '19 11:06 Shreeshrii

Also see https://github.com/tesseract-ocr/tessdata/issues/65

Shreeshrii avatar Jun 03 '19 11:06 Shreeshrii

Is langdata obsolete as langdata_lstm exists?

mikegerber avatar Jun 03 '19 13:06 mikegerber

langdata files are appropriate for tesseract 3 or for legacy/base versions using tesseract 4. They can also be used for finetuning which requires a smaller input training text.

Shreeshrii avatar Jun 03 '19 13:06 Shreeshrii

As @Shreeshrii already said, langdata_lstm is for LSTM models while langdata is for legacy models. Both kinds of models are still used.

The scriptmodels are mixtures of different languages. script/Fraktur for example combines enm+frm+frk+ita_old+spa_old.

I fixed the description for 4.00 frk in the Wiki. The other Wiki issues are still open.

stweil avatar Jun 13 '19 17:06 stweil

Fraktur Tesseract OCR is what I am looking for,.... I installed VietOCR v5.5.2 and Tesseract 4.1.0 on my mac, and now I am trying to find help on how to train it better.... there are too many OCR errors...

How would I go about training the software? Can anyone help?

I am a total retard, ...sadly,.... and I do not even know how I was able to install the two components so far..... and this training step is nowhere explained

Any help into the right direction would be greatly appreciated

Akossimon avatar Oct 01 '19 20:10 Akossimon

In the meantime newer Fraktur models are available. There is a description of the training process for those models in the Wiki.

As soon as the training is finished, I'll add the results to tessdata_contrib.

stweil avatar Nov 11 '19 15:11 stweil

@mikegerber, can we close this issue?

stweil avatar Jan 24 '20 08:01 stweil