langdata
langdata copied to clipboard
Add vulgar fraction for 1/2
@theraysmith
Please see https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-297400009
Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:
https://cloud.githubusercontent.com/assets/1194896/25436113/477a23b6-2a60-11e7-967f-c4b97b21e3a9.png
I could not find any font which has 1/2 in this vertical format with straight line between 1 and 2.
This is not limited to English, but applies to more Latin based languages.
@stweil
-
Do you know of any font which is similar to the image, has 1/2 in vertical format?
-
Do other fractions (1/4, 3/4. 1/3 etc) also need to be supported?
- I saw that you did not find a matching font. Nor did I in a short search, but I have that on my list now.
- I'm afraid, yes, although I assume that 1/4 occurs less often than 1/2, and other fractions are even more rare. Collecting examples of such cases is also on my list of things to be done. Maybe someone has an old book with cooking recipes - I expect we can find more fractions there than in listings of house numbers.
Pango, which is what we use to render the images with text2image, supports MathML.
Now we only need a Tesseract which can detect formulae in images and generate hOCR with MathML for those formulae. :-)
Now we only need a Tesseract which can detect formulae in images
https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/equationdetect.h
https://github.com/tesseract-ocr/tesseract/issues/2274#issuecomment-481596368
It is possible to finetune to recognize fractions. See above comment.
Also with a tool such as https://www.calligraphr.com/en/ it is possible to create a ttf with the desired form of characters and then use it for generating synthetic data. It will work well for Latin script based languages that do not have many ligatures or combining marks.
Font which has the fractions with numbers vertically above each other with a horizontal bar in between - https://www.myfonts.com/fonts/russian-fonts/rf-rostin/
https://graphicdesign.stackexchange.com/questions/71097/fractions-in-indesign-typing-not-%C2%BD-alt-0189 has a short list:
A list of some Google Fonts (all free) that you can use (thanks to @RadLexus):
Coda – by Vernon Adams Telex – by Huerta Tipografica Arbutus Slab – by Karolina Lach Unica One – by Eduardo Tunni Concert One Cherry Swash – by Nataliya Kasatkina Economica – by Vicente Lamonaca Special Elite – by Astigmatic