tessdata icon indicating copy to clipboard operation
tessdata copied to clipboard

Added best traineddatas for 4.00 alpha

Open amitdo opened this issue 7 years ago • 22 comments

https://github.com/tesseract-ocr/tessdata/tree/3a94ddd47be0

@theraysmith , How to present those 'best' files to our users? https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

Do you plan to push more updates to the best directory and/or to the root dir in the next few weeks?

amitdo avatar Aug 01 '17 07:08 amitdo

The new files include two files for German Fraktur: best/Fraktur.traineddata and best/frk.traineddata. According to my first tests, both are better than the old deu_frak.traineddata and much better than the old frk.traineddata. There is not a clear winner for the two new files: in some cases -l Fraktur gives better results, in some other cases -l frk is better. Even a 3.05 based Fraktur model still is better for some words, but generally the new LSTM based models win the challenge.

Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries?

stweil avatar Aug 01 '17 10:08 stweil

Related comment from Ray: https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314609036

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.

amitdo avatar Aug 01 '17 12:08 amitdo

My guess is that the upper case traineddata files are for 'one script multi langs'.

amitdo avatar Aug 01 '17 13:08 amitdo

I'm currently working on the training documentation, before committing more code, so as not to leave training broken for more than maybe an hour or so. Here's a quick bullet list of what's going on:

  • Initial capitals indicate the one model for all langs in that script, so eg Latin is all latin-based languages except vie, which has its own Vietnamese. Most of the script models include English training data as well as the script, but not for Cyrillic, as that would have a major ambiguity problem. Devanagari is hin+san+mar+nep+eng, and Fraktur is basically a combination of all the latin-based languages that have an 'old' variant, etc... I would be interested to hear more feedback on the Script models as Stefan already provided for Fraktur.
  • The tessdata directory doesn't have to be called tessdata any more, so I was thinking of a structuring that allowed maybe best, fast and legacy as separate directories or repos.
  • I noticed git complain about the size of Latin.traineddata (~100MB), but didn't yet follow the pointer to git large data.
  • The current code can run the 'best' models, and the existing models, but incremental and fine tuning training will be tied to 'best' with a future commit/push. (Due to a switch to ADAM and the move of the unicharset/recoder).
  • Fine tuning/incremental training will not be possible from the 'fast' models, as they are 8-bit integer. It will be possible to convert a tuned best to integer to make it faster, but some of the speed in 'fast' will be from the smaller model.
  • It will be possible to add new characters by fine tuning! I got that working yesterday, and just need to finish updating the documentation.

On Tue, Aug 1, 2017 at 6:50 AM, Amit D. [email protected] wrote:

My guess is that the upper case traineddata files are for 'one script multi lang'.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319375598, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056YqGx17BmwGUzEaK5AFDE67fqr_rks5sTy0ggaJpZM4OpU_- .

-- Ray.

theraysmith avatar Aug 01 '17 17:08 theraysmith

Ray,

Please see Devanagari feedback at https://github.com/tesseract-ocr/tessdata/issues/66 https://github.com/tesseract-ocr/tessdata/issues/64

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 1, 2017 at 11:08 PM, theraysmith [email protected] wrote:

I'm currently working on the training documentation, before committing more code, so as not to leave training broken for more than maybe an hour or so. Here's a quick bullet list of what's going on:

  • Initial capitals indicate the one model for all langs in that script, so eg Latin is all latin-based languages except vie, which has its own Vietnamese. Most of the script models include English training data as well as the script, but not for Cyrillic, as that would have a major ambiguity problem. Devanagari is hin+san+mar+nep+eng, and Fraktur is basically a combination of all the latin-based languages that have an 'old' variant, etc... I would be interested to hear more feedback on the Script models as Stefan already provided for Fraktur.
  • The tessdata directory doesn't have to be called tessdata any more, so I was thinking of a structuring that allowed maybe best, fast and legacy as separate directories or repos.
  • I noticed git complain about the size of Latin.traineddata (~100MB), but didn't yet follow the pointer to git large data.
  • The current code can run the 'best' models, and the existing models, but incremental and fine tuning training will be tied to 'best' with a future commit/push. (Due to a switch to ADAM and the move of the unicharset/recoder).
  • Fine tuning/incremental training will not be possible from the 'fast' models, as they are 8-bit integer. It will be possible to convert a tuned best to integer to make it faster, but some of the speed in 'fast' will be from the smaller model.
  • It will be possible to add new characters by fine tuning! I got that working yesterday, and just need to finish updating the documentation.

On Tue, Aug 1, 2017 at 6:50 AM, Amit D. [email protected] wrote:

My guess is that the upper case traineddata files are for 'one script multi lang'.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/tesseract-ocr/tessdata/issues/62# issuecomment-319375598>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ AL056YqGx17BmwGUzEaK5AFDE67fqr_rks5sTy0ggaJpZM4OpU_-> .

-- Ray.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319442674, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7elr2M-RcaG2eiGMykqVylg0uQ1ks5sT2KmgaJpZM4OpU_- .

Shreeshrii avatar Aug 01 '17 17:08 Shreeshrii

New traineddata files: Arabic.traineddata Armenian.traineddata Bengali.traineddata Canadian_Aboriginal.traineddata Cherokee.traineddata Cyrillic.traineddata Devanagari.traineddata Ethiopic.traineddata Fraktur.traineddata Georgian.traineddata Greek.traineddata Gujarati.traineddata Gurmukhi.traineddata HanS.traineddata HanS_vert.traineddata HanT.traineddata HanT_vert.traineddata Hangul.traineddata Hangul_vert.traineddata Hebrew.traineddata Japanese.traineddata Japanese_vert.traineddata Kannada.traineddata Khmer.traineddata Lao.traineddata Latin.traineddata Malayalam.traineddata Myanmar.traineddata Oriya.traineddata Sinhala.traineddata Syriac.traineddata Tamil.traineddata Telugu.traineddata Thaana.traineddata Thai.traineddata Tibetan.traineddata Vietnamese.traineddata bre.traineddata chi_sim_vert.traineddata chi_tra_vert.traineddata cos.traineddata div.traineddata fao.traineddata fil.traineddata fry.traineddata gla.traineddata hye.traineddata jpn_vert.traineddata kor_vert.traineddata kur_ara.traineddata ltz.traineddata mon.traineddata mri.traineddata oci.traineddata que.traineddata snd.traineddata sun.traineddata tat.traineddata ton.traineddata yor.traineddata

amitdo avatar Aug 01 '17 20:08 amitdo

It will be possible to add new characters by fine tuning!

That's great! Then I can add missing characters (like paragraph for Fraktur) myself. Thank you, Ray.

stweil avatar Aug 02 '17 04:08 stweil

Ray, issue #65 lists two regressions for Fraktur (missing §, ß/B confusion in word list).

stweil avatar Aug 02 '17 16:08 stweil

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.

Since I spotted the edits to the deu/frk wordlists before overwriting them, I will put the deleted words in the bad_words lists, so my next run of training will not contain them. Looks like I also need to add § to the desired_characters.

I have not yet committed the new wordlists, desired_characters etc, since I discovered a bug. The RTL languages have their wordlists reversed, which doesn't make sense. They should be plain text readable by someone who knows the language, and the reversal should be done before the words are converted to dawgs. I have the required change in the code already, but haven't yet run the synthetic data generation.

On Wed, Aug 2, 2017 at 9:03 AM, Stefan Weil [email protected] wrote:

Ray, issue #65 https://github.com/tesseract-ocr/tessdata/issues/65 lists two regressions for Fraktur (missing §, ß/B confusion in word list).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319718350, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056QPzkmLa31xAVUTDXnVOGOAZAEWZks5sUJ3DgaJpZM4OpU_- .

-- Ray.

theraysmith avatar Aug 03 '17 01:08 theraysmith

The new files can be installed locally in tessdata/best and used like that: tesseract ... -l best/eng, so we can preserve the current directory structure (also when fast will be added), and there is no need to rename best/eng.traineddata to best_eng.traineddata in local installations.

I assume that older versions of Tesseract work with hierarchies of languages, too. That offers new possibilities: the rather lengthy list of languages could be organized in folders for example for latin based languages, indic languages etc.

Of course tesseract --list-langs should be improved to search recursively for language files.

stweil avatar Aug 04 '17 09:08 stweil

used like that: tesseract ... -l best/eng

That is great.

I was using --tessdata-dir ../../../tessdata/best

but this is much easier :-)

Shreeshrii avatar Aug 04 '17 10:08 Shreeshrii

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training.

@theraysmith

The training wiki changes say that new traineddata can be built by providing wordlists. Here you mention that they are generated.

Can you explain, whether user provided wordlists override the ones in traineddata and how it would impact recognition.

I haven't tried training with new code yet.

PS. Hope you have seen language specific feedback provided under issues in tessdata.

Shreeshrii avatar Aug 04 '17 10:08 Shreeshrii

https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf Page 8 T-LSTM Training

amitdo avatar Aug 04 '17 11:08 amitdo

http://usir.salford.ac.uk/44370/1/PID4978585.pdf ICDAR2017 Competition on Recognition of Early Indian Printed Documents – REID2017

amitdo avatar Dec 15 '17 17:12 amitdo

@theraysmith commented on Aug 3, 2017

I have the required change in the code already, but haven't yet run the synthetic data generation.

I will put the deleted words in the bad_words lists, so my next run of training will not contain them.

@theraysmith @jbreiden Can you confirm that the traineddata files in Github repo are the result of this improved training?

Shreeshrii avatar May 25 '18 08:05 Shreeshrii

They aren't, because they were added in July 2017 – that is before that comment.

stweil avatar May 25 '18 08:05 stweil

What about tessdata_fast?

Initial import to github (on behalf of Ray)
Jeff Breidenbach committed on Sep 15, 2017

Shreeshrii avatar May 25 '18 09:05 Shreeshrii

tessdata_fast changed the LSTM model, but not the word list and other components. I just looked for B/ß confusions. While deu.traineddata looks good (no B/ß confusions), frk.traineddata contains lots of them, for example auBer instead of außer. frk.traineddata also contains lots of words which typically are not printed in Fraktur. Neither eBay nor PCMCIA are words which I would expect in old books or newspapers.

stweil avatar May 25 '18 16:05 stweil

@theraysmith can you update the Langdata/ara

ghost avatar Jun 11 '18 13:06 ghost

New traineddata files: Arabic.traineddata Armenian.traineddata Bengali.traineddata Canadian_Aboriginal.traineddata Cherokee.traineddata Cyrillic.traineddata Devanagari.traineddata Ethiopic.traineddata Fraktur.traineddata Georgian.traineddata Greek.traineddata Gujarati.traineddata Gurmukhi.traineddata HanS.traineddata HanS_vert.traineddata HanT.traineddata HanT_vert.traineddata Hangul.traineddata Hangul_vert.traineddata Hebrew.traineddata Japanese.traineddata Japanese_vert.traineddata Kannada.traineddata Khmer.traineddata Lao.traineddata Latin.traineddata Malayalam.traineddata Myanmar.traineddata Oriya.traineddata Sinhala.traineddata Syriac.traineddata Tamil.traineddata Telugu.traineddata Thaana.traineddata Thai.traineddata Tibetan.traineddata Vietnamese.traineddata bre.traineddata chi_sim_vert.traineddata chi_tra_vert.traineddata cos.traineddata div.traineddata fao.traineddata fil.traineddata fry.traineddata gla.traineddata hye.traineddata jpn_vert.traineddata kor_vert.traineddata kur_ara.traineddata ltz.traineddata mon.traineddata mri.traineddata oci.traineddata que.traineddata snd.traineddata sun.traineddata tat.traineddata ton.traineddata yor.traineddata

from where we can download these trained data for better aaccuracy.

kmprerna avatar Apr 22 '19 09:04 kmprerna

https://github.com/tesseract-ocr/tessdata_best

https://github.com/tesseract-ocr/tessdata_best/tree/master/script

Shreeshrii avatar Apr 22 '19 09:04 Shreeshrii

When I'm using this trained data for hindi text based image, it's taking long time to extract text and not giving 100% accurate result. So how to reduce the response time.

kmprerna avatar Apr 22 '19 11:04 kmprerna