langdata
langdata copied to clipboard
German Fraktur
From https://github.com/tesseract-ocr/tesseract/issues/40
@stweil commented
Are there also new data files planned for old German (deu_frak)? I was surprised that the default English model with LSTM could recognize some words.
@theraysmith commented
I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand. With English at least, the language was different in the days of Fraktur (Ye Olde shoppe). I know German continued to be written in Fraktur until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English
stweil commented
Fraktur was used for an important German newspaper (Reichsanzeiger) until 1945. I'd like to try some pages from that newspaper with Tesseract LSTM. Surprisingly even with the English data Tesseract was able to recognize at least some words written in Fraktur.
There is an Old High German (similar to Old English), but the German translation of the New Testament by Martin Luther (1521) was one of the first major printed books in German, and basically it started the modern German language (High German) which is used until today.
@jbaiter commented
I have a decent amount of corpus data for Fraktur from scanned books at hand, about 500k lines in hOCR files (~50GB with TIF images). I've yet to publish it, but if you have somewhere where I could send/upload it, I'd be glad to.
theraysmith commented
The md file documents the training process in tutorial detail, but line boxes and transcriptions sounds perfect!
300k lines should make it work really well. I would be happy to take it and help you, but we would have to get into licenses, copyright and all that first. For now it might be best to hang on for the instructions.
jbaiter commented
The text is CC0 and the images are CC-BY-NC, so that shouldn't be an issue :-) They're going to be public anyway once I've prepped the dataset for publication.
Related: https://github.com/tesseract-ocr/tessdata/issues/49
@jbaiter,
I suggest you to upload the textual part to a GitHub repo. Add the CC0 license info and mention the source of the data (which books and/or newspapers were used, who transcribed them).
Hopefully, Ray will use it to train a new (LSTM) deu_frak trainedata.
I found a problem with the synthetic training pipeline. The fraktur fonts were only about 1% of the training data, even for the frk language. This will be fixed in my next training, which I hope to start this week (and for the past 4 weeks).
I'm also going to fix the single char/single word issue that was raised as an objection to deleting the legacy engine.
There will also be major changes to the Indic training data, but I have no idea whether it will affect the accuracy, as it still doesn't work properly...
I now have a lot more training data for even the languages where before I said I didn't have much.
On Mon, Mar 13, 2017 at 7:24 AM, Amit D. [email protected] wrote:
@jbaiter https://github.com/jbaiter,
I suggest you to upload the textual part to a Github repo. Add the CC0 license info and mention the source of the data (which books and/or newspapers were used, who transcribed them).
Hopefully, Ray will use it to train a new (LSTM) deu_frak trainedata.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-286122105, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056QyLxgTF5MGsbrwEzgC2uQYhyKy_ks5rlVGdgaJpZM4MbSSf .
-- Ray.
Ray, could you please have a look on the questions which I have sent to the tesseract dev forum regarding quality of the training data and characters used for the training, ideally before you start a new training? See also issue #55 (which also applies in similar form to any other European language, even eng
: all those languages currently use incomplete character sets).
I have done a legacy training using the existing deu_frak box/tiff pairs, a few box/tif pairs from emop and some synthetic tif/box pairs using fonts.
Traineddata is attached. @stweil, you can check how it compares to old deu_frak as well as your trials at training.
At a first glance there is no clear winner. Your deu_frak.traineddata
improves the recognition for some characters / words, but also produces words which exist in German language but don't match the image. Some of my experiments with the legacy training based on frk
gave similar results, two of them look better. I'll continue those tests and report more precise data later.
P.S. I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.
produces words which exist in German language but don't match the image.
Yes, I had noticed that with Devanagari script with the legacy traineddata also.
Could it be related to the dictionary/wordlists/dawgs?
Hope you are able to get improved accuracy with your training for Fraktur.
I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.
Lies, damned lies, and statistics
Don't believe unless you can test it yourself on a large and diverse dataset.
German American Newspapers
https://collection1.libraries.psu.edu/cdm/search/collection/frak/searchterm/newspapers
I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.
https://arxiv.org/pdf/1701.07395.pdf
They say 97% character accuracy rate, after training 400 lines with ocropy.
This is training specific to one book/font. Tesseract does generalized training with many fonts.
Have you had any success running ocropus ?
- excuse the brevity, sent from mobile
On 23-Mar-2017 9:50 PM, "Amit D." [email protected] wrote:
I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.
https://arxiv.org/pdf/1701.07395.pdf
They say 97% character accuracy rate, after training 400 lines with ocropy.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-288774659, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o_rrfHURN9izQg2DkXOai68NVc4Oks5ropvPgaJpZM4MbSSf .
This is training specific to one book/font. Tesseract does generalized training with many fonts.
So does ocropy. But you can improve the results if you train for a specific book. I would use the generic trained data and build upon it.
Building upon existing trained data is currently not possible because that data does not include all needed characters, and adding characters is unsupported with LSTM.
Still, you can make a generic trained data yourself with all the characters you want from a large set of digital fonts, and then fine tune with 100-400 lines from a book/newspaper. This is relevant to both Tesseract and ocropus.
Amit, LSTM process for training from scanned images is not defined yet.
- excuse the brevity, sent from mobile
On 24-Mar-2017 1:49 AM, "Amit D." [email protected] wrote:
Still, you can make a generic trained data yourself with all the the chars you wants from digital fonts, and then fine tune with 100-400 lines from a book/newspaper. This is relevant to both Tesseract and ocropus.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-288847479, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyMhSkopCY8EWq3EY7VjPq-GtEOvks5rotO0gaJpZM4MbSSf .
https://github.com/jze/ocropus-model_fraktur
This is a character model for recognizing Fraktur font with OCRopus. With test data from a book that has not been used in the training process it yields an excellent error rate of 0.296%. It is slightly better that the 'standard' Fraktur model which has an error rate of 0.466%.
In the case you mention the difference is insignificant. I read some reports on a much larger difference.
After a lot of work, and a very long delay, the new training is almost ready to go. Just waiting for rendering to finish...
Fixes in this round: Utilizes a new crawl of the www for ~60 languages that had the least training data, and ~15 new languages that we didn't have before. This provides much more training data, with better estimates of what is in the character set and better wordlists. I've just checked over this thread, the thread on tesseract-dev, and issue 55, and all the requested missing characters will be in. frk, enm, frm, ita_old, and spa_old will all have much better response to Fraktur, and probably worse response to non-Fraktur. Previously there was a bug, and <1% of training images were Fraktur. Now it will be more like 75%. New and improved text filters for languages that use a "virama" character. The training data for all the Indic languages is thus much cleaner, but until it is trained, I have no idea of the effect on accuracy. Single character/grapheme and single word entries are added to the training sets, which should improve accuracy on shorter lines.
I've also added an experiment to throw all the Latin languages together into a single engine. (Actually a separate model for each of 36 scripts). If that works it will solve the problem of reading Citroen in German and picking up the e umlaut. The downside is that this model has almost 400 characters in it, despite carefully keeping out the long-tail graphics characters. Even if it does work, it will be slower, but possibly not much slower than running 2 languages. It will have about 56 languages in it. I have some optimism that this may work, ever since I discovered that the vie LSTM model gets the phototest.tif image 100% correct.
I'm also experimenting with a new osd model. It has to be replaced to eliminate the old engine.
On Fri, Mar 24, 2017 at 2:27 AM, Amit D. [email protected] wrote:
In the case you mention the difference is insignificant. I read some reports on a much larger difference.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-288973591, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056aIgV_fkwMhJIAlZXkqjBRpd1cqbks5ro4ycgaJpZM4MbSSf .
-- Ray.
Ray, Thanks for your hard work and thanks for this update!
Thanks for the update and your work on this, Ray.
Just checking whether this new training will also address:
-
Devanagari transliterated in Roman script with accents eg. http://www.claysanskritlibrary.org/excerpts/CSLFrontMatter.pdf
-
Correct handling of superscripts, TM and other signs
-
Traineddata for MICR
-
Traineddata for Seven Segment (or 14 segment) Display
-
Allow for whitelisting/blacklisting to ensure only numeric results.
I look forward to testing with the newer code and Indic traineddata.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 30, 2017 at 3:12 AM, theraysmith [email protected] wrote:
After a lot of work, and a very long delay, the new training is almost ready to go. Just waiting for rendering to finish...
Fixes in this round: Utilizes a new crawl of the www for ~60 languages that had the least training data, and ~15 new languages that we didn't have before. This provides much more training data, with better estimates of what is in the character set and better wordlists. I've just checked over this thread, the thread on tesseract-dev, and issue 55, and all the requested missing characters will be in. frk, enm, frm, ita_old, and spa_old will all have much better response to Fraktur, and probably worse response to non-Fraktur. Previously there was a bug, and <1% of training images were Fraktur. Now it will be more like 75%. New and improved text filters for languages that use a "virama" character. The training data for all the Indic languages is thus much cleaner, but until it is trained, I have no idea of the effect on accuracy. Single character/grapheme and single word entries are added to the training sets, which should improve accuracy on shorter lines.
I've also added an experiment to throw all the Latin languages together into a single engine. (Actually a separate model for each of 36 scripts). If that works it will solve the problem of reading Citroen in German and picking up the e umlaut. The downside is that this model has almost 400 characters in it, despite carefully keeping out the long-tail graphics characters. Even if it does work, it will be slower, but possibly not much slower than running 2 languages. It will have about 56 languages in it. I have some optimism that this may work, ever since I discovered that the vie LSTM model gets the phototest.tif image 100% correct.
I'm also experimenting with a new osd model. It has to be replaced to eliminate the old engine.
On Fri, Mar 24, 2017 at 2:27 AM, Amit D. [email protected] wrote:
In the case you mention the difference is insignificant. I read some reports on a much larger difference.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/tesseract-ocr/langdata/issues/59# issuecomment-288973591>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056aIgV_ fkwMhJIAlZXkqjBRpd1cqbks5ro4ycgaJpZM4MbSSf> .
-- Ray.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290235084, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_owOP5L5IOQL1iMwPWFjpLSwx14dAks5rqtAugaJpZM4MbSSf .
On Wed, Mar 29, 2017 at 9:32 PM, Shreeshrii [email protected] wrote:
Thanks for the update and your work on this, Ray.
Just checking whether this new training will also address:
- Devanagari transliterated in Roman script with accents eg. http://www.claysanskritlibrary.org/excerpts/CSLFrontMatter.pdf
Will probably be handled by the 'Latin' language.
- Correct handling of superscripts, TM and other signs
Beyond the scope of this change. Sub/superscript are much harder to deal with, as they have to be trained, and that means incorporating them correctly into the training path, and how to pass the information back out of the line recognizer to the output. At the moment it seems the iterator supports discovery of sub/super, but there is no output renderer that handles it. (Not even hocr?) TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output. Question: For which languages/scripts is is desirable to support sub/super?
- Traineddata for MICR
Beyond the scope of this change.
- Traineddata for Seven Segment (or 14 segment) Display
Beyond the scope of this change.
- Allow for whitelisting/blacklisting to ensure only numeric results.
A simple code change not related to training.
I look forward to testing with the newer code and Indic traineddata.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 30, 2017 at 3:12 AM, theraysmith [email protected] wrote:
After a lot of work, and a very long delay, the new training is almost ready to go. Just waiting for rendering to finish...
Fixes in this round: Utilizes a new crawl of the www for ~60 languages that had the least training data, and ~15 new languages that we didn't have before. This provides much more training data, with better estimates of what is in the character set and better wordlists. I've just checked over this thread, the thread on tesseract-dev, and issue 55, and all the requested missing characters will be in. frk, enm, frm, ita_old, and spa_old will all have much better response to Fraktur, and probably worse response to non-Fraktur. Previously there was a bug, and <1% of training images were Fraktur. Now it will be more like 75%. New and improved text filters for languages that use a "virama" character. The training data for all the Indic languages is thus much cleaner, but until it is trained, I have no idea of the effect on accuracy. Single character/grapheme and single word entries are added to the training sets, which should improve accuracy on shorter lines.
I've also added an experiment to throw all the Latin languages together into a single engine. (Actually a separate model for each of 36 scripts). If that works it will solve the problem of reading Citroen in German and picking up the e umlaut. The downside is that this model has almost 400 characters in it, despite carefully keeping out the long-tail graphics characters. Even if it does work, it will be slower, but possibly not much slower than running 2 languages. It will have about 56 languages in it. I have some optimism that this may work, ever since I discovered that the vie LSTM model gets the phototest.tif image 100% correct.
I'm also experimenting with a new osd model. It has to be replaced to eliminate the old engine.
On Fri, Mar 24, 2017 at 2:27 AM, Amit D. [email protected] wrote:
In the case you mention the difference is insignificant. I read some reports on a much larger difference.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/tesseract-ocr/langdata/issues/59# issuecomment-288973591>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056aIgV_ fkwMhJIAlZXkqjBRpd1cqbks5ro4ycgaJpZM4MbSSf> .
-- Ray.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/tesseract-ocr/langdata/issues/59# issuecomment-290235084>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ owOP5L5IOQL1iMwPWFjpLSwx14dAks5rqtAugaJpZM4MbSSf> .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290299678, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056UXLwhYOZQxRJIEN2gKCJcUgKtFZks5rqzBIgaJpZM4MbSSf .
-- Ray.
Ray, Thanks for your prompt response.
-
I hope you have also noted that language code frk is for Frankish, which is not the same as German Fraktur. It maybe helpful to update the langdata and add deu_frak, dan_frak etc, or at least a generic frak similar to latn and deva.
-
I am trying to do finetune training for seven segment display using eng.traineddata as the base and training text in about 10 SSD fonts with numbers and CAPITAL letters. Is that the recommended strategy or would replacing a layer give better results?
Also, should any kind of wordlist/dictionary be included for what maybe random combinations of letters and numbers?
- Regarding superscripts/subscripts etc, I can point out three cases based on the languages I know.
a. English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.
b. Tamil - Sanskrit texts transliterated in Tamil scripts use superscripts/subscripts 2,3,4 (sometimes 1 also) to distinguish between different sounds (to support sanskrit alphabet which does not have direct mapping in Tamil script). These can actually be in middle of Tamil words.
c. Hindi, Sanskrit and other Indian languages - Hindi books, thesis etc use superscripts for referring to footnotes (similar to English above). The difference is that in some cases these will be using the Latin alphabet 0-9 and in some cases using Devanagari digits (in case of Hindi, Sanskrit etc). Unicode has superscripts 0-9 for Latin script but not for Devanagari script. I would suggest support for the Latin script superscript numbers.
Scanned pages with devanagari superscripts should also be mapped to the Latin script superscript numbers. Similarly for other Indian languages.
-
TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output.
Is this controlled via the normalized form in the unicharset? Can different processing be applied based on the normalized form there?
thanks!
language code frk is for Frankish, which is not the same as German Fraktur
As the current data tried to implement German Fraktur, renaming frk
to deu_frak
might be the simplest fix for the moment.
English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.
At least it applies to German. There are also superscripts after punctuation characters at the end of sentences.
Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹
, ²
or ³
.
See page 3 in http://sanskritdocuments.org/doc_ganesha/gaNanAyak8-ta.pdf for superscripts usage in Tamil.
Sample of subscript numbers usage in Tamil - http://srivaishnavam.com/stotras/sristuti_tamil.pdf
Please see https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
Unicode has subscripted and superscripted versions of a number of characters including a full set of Arabic numerals.
The most common superscript digits (1, 2, and 3) were in ISO-8859-1 and were therefore carried over into those positions in the Latin-1 range of Unicode. The rest were placed in a dedicated section of Unicode at U+2070 to U+209F.
Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.
All superscripts have a special UTF-8 code, though in different ranges. Not all fonts have support for all superscripts and subscripts.
http://www.alanwood.net/unicode/latin_1_supplement.html
http://www.alanwood.net/unicode/superscripts_and_subscripts.html
I opened a new issue for 'Superscripts & suberscripts' at #62
language code frk is for Frankish, which is not the same as German Fraktur As the current data tried to implement German Fraktur, renaming frk to deu_frak might be the simplest fix for the moment.
I agree.
I opened a new issue for 'Correct handling of TM sign' at #63
Thanks for opening the new issues 62, 63. I will continue to think about the best approach. I tried to include TM in the current round of training, but it is too infrequent to have made the cut line. I will have to add it to the desired_characters list. Where is frk documented as Frankish? It does NOT occur in my usual reference: https://www.loc.gov/standards/iso639-2/php/code_list.php We have inconsistent naming for the old versions of european languages: enm frm ita_old spa_old frk How would it suit to have a generic "Fraktur" language that covers all of these, and trained with ~50% Fraktur fonts and 50% the other 4500 Latin fonts?
On Fri, Mar 31, 2017 at 2:07 AM, Amit D. [email protected] wrote:
I opened a new issue for 'Correct handling of TM sign' at #63 https://github.com/tesseract-ocr/langdata/issues/63
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290660213, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056QOzlayilZbgoAlpAnsaMIp_5cmUks5rrMJKgaJpZM4MbSSf .
-- Ray.