tessdata_best icon indicating copy to clipboard operation
tessdata_best copied to clipboard

old russian / church slavonic glyphs?

Open yurytch opened this issue 7 years ago • 13 comments
trafficstars

Is it possible to add support for the Old Russian / Church Slavonic glyphs, at least for the 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475) ?

yurytch avatar Mar 29 '18 07:03 yurytch

Which language traineddata are you using currently?

Shreeshrii avatar Mar 29 '18 09:03 Shreeshrii

I'm using 'rus' from tessdata_best. Tried adding 'bul' and 'srp', to no avail. Would be great if there were an additional datafile just for those glyphs recognition, also with cursive (yat!). Does tesseract work like this?

yurytch avatar Mar 29 '18 09:03 yurytch

  1. Try with 'rus' from tessdata_fast and see if that is better.

  2. Try the 'pluschar' training using 'rus' from tessdata_best as the continue_from model. Add at least 15 occurrences of the Old Russian / Church Slavonic glyphs that you want to add so that they get picked us in the unicharset.

  3. Also try with script/Cyrillic (or other appropriate script use for Russian).

  4. Please share about 150 lines of training text which has the added glyphs for testing.

Shreeshrii avatar Mar 29 '18 09:03 Shreeshrii

@Shreeshrii While I'm trying to make sense of that plus-training procedure (your point 2): your pt. 1 doesn't work (more OCR errors with 'rus' from *_fast), I don't understand your pt. 3 - 'rus' is Cyrillic anyway, and 'yat' etc. are Cyrillic., too. Regarding the pt. 4: do you mean the training text, like for inclusion in the 'rus' training dataset? But wouldn't you want the graphics with real typeset glyphs for that, too?

yurytch avatar Mar 30 '18 05:03 yurytch

Ray has trained for languages eg. Eng, rus and also for scripts in which various languages are written eg. Latin script for english, french, German etc.

My suggestion was for you to use script/Cyrrilic to compare results with rus. In case the letters you want to add are in one of the other languages, they might be recognised.

Re. 4, yes along with training text, also need a font which will render those glyphs correctly.

Shreeshrii avatar Mar 30 '18 05:03 Shreeshrii

Please review the following files:

https://github.com/tesseract-ocr/langdata/tree/master/rus https://github.com/tesseract-ocr/langdata/blob/master/rus/desired_characters

https://github.com/tesseract-ocr/langdata/blob/master/Cyrillic.unicharset

Adding these glyphs will require changes in lagdata repo for rus, eg. adding these glyphs to desired_characters file.

Shreeshrii avatar Mar 30 '18 11:03 Shreeshrii

Does anybody know about any progress related to the subject - Old Russian support for tesseract ?

maxirmx avatar Oct 08 '20 12:10 maxirmx

@maxirmx, maybe you can contribute by reviewing the files named above?

stweil avatar Oct 08 '20 13:10 stweil

@stweil, thank you.
https://github.com/tesseract-ocr/langdata/tree/master/rus is 'modern Russian'.
I have asked about older Russian that included three letters were made obsolete in 1917/1918. They were mentioned in the start of this thread: 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475). I would imagine additional complications as well such as different paragraph sign and different fonts used at that time.

It is somewhat clear what to do, but I do not want to repeat other's work that might be done already.

maxirmx avatar Oct 27 '20 16:10 maxirmx

Okay, "Ѣ" and maybe the other older glyphs are also missing in https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Cyrillic/Cyrillic.unicharset.

So you will need ground truth data to train a new model based on rus.traineddata or Cyrillic.traineddata, but with the additional glyphs. As soon as you have line images with text transcriptions, this process is supported pretty well with tesstrain.

stweil avatar Oct 27 '20 16:10 stweil

See also issue https://github.com/tesseract-ocr/langdata_lstm/issues/3 which looks like a duplicate. Maybe you can join efforts.

stweil avatar Oct 27 '20 17:10 stweil

@stweil are there any requirements for the training words/text (except beforementioned 150 lines)? For example, how many times each new character should be met in training set? Should there be at least 1 capital and non-capital letter? or something like that?

Arbitrary text(s) in old russian can be obtained, for example, from ru.wikisource.org. For example, https://ru.wikisource.org/wiki/%D0%91%D0%BE%D0%B6%D0%B5%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D0%BA%D0%BE%D0%BC%D0%B5%D0%B4%D0%B8%D1%8F_(%D0%94%D0%B0%D0%BD%D1%82%D0%B5;_%D0%9C%D0%B8%D0%BD)/%D0%94%D0%9E.

dvrogozh avatar Dec 25 '20 05:12 dvrogozh

I still fail to comprehend the process well enough. But I guess I understand why glyph can't be 'added' to an existing dataset -- because of how the deep learning works, right? But retraining the complete set is rather beyond my resources, in terms of computing power and time.

Here's a thought/question: would it be useful to train a separate (small) set consisting of those missing glyphs and glyphs that look like those missing ones? I.e. consisting of 'YAT's and 'HARD SIGN's. Then one could use it in a set of languages: rus+yat Would this work at all?

yurytch avatar Dec 25 '20 06:12 yurytch