langdata icon indicating copy to clipboard operation
langdata copied to clipboard

Add vulgar fraction for 1/2

Open Shreeshrii opened this issue 7 years ago • 10 comments

@theraysmith

Please see https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-297400009

Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

https://cloud.githubusercontent.com/assets/1194896/25436113/477a23b6-2a60-11e7-967f-c4b97b21e3a9.png

I could not find any font which has 1/2 in this vertical format with straight line between 1 and 2.

Shreeshrii avatar Apr 27 '17 12:04 Shreeshrii

This is not limited to English, but applies to more Latin based languages.

stweil avatar Apr 27 '17 13:04 stweil

@stweil

  1. Do you know of any font which is similar to the image, has 1/2 in vertical format?

  2. Do other fractions (1/4, 3/4. 1/3 etc) also need to be supported?

Shreeshrii avatar Apr 27 '17 13:04 Shreeshrii

  1. I saw that you did not find a matching font. Nor did I in a short search, but I have that on my list now.
  2. I'm afraid, yes, although I assume that 1/4 occurs less often than 1/2, and other fractions are even more rare. Collecting examples of such cases is also on my list of things to be done. Maybe someone has an old book with cooking recipes - I expect we can find more fractions there than in listings of house numbers.

stweil avatar Apr 27 '17 13:04 stweil

Pango, which is what we use to render the images with text2image, supports MathML.

amitdo avatar Apr 27 '17 14:04 amitdo

Now we only need a Tesseract which can detect formulae in images and generate hOCR with MathML for those formulae. :-)

stweil avatar Apr 27 '17 14:04 stweil

Now we only need a Tesseract which can detect formulae in images

https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/equationdetect.h

amitdo avatar Apr 27 '17 14:04 amitdo

https://github.com/tesseract-ocr/tesseract/issues/2274#issuecomment-481596368

It is possible to finetune to recognize fractions. See above comment.

Shreeshrii avatar Apr 10 '19 09:04 Shreeshrii

Also with a tool such as https://www.calligraphr.com/en/ it is possible to create a ttf with the desired form of characters and then use it for generating synthetic data. It will work well for Latin script based languages that do not have many ligatures or combining marks.

Shreeshrii avatar Apr 10 '19 09:04 Shreeshrii

Font which has the fractions with numbers vertically above each other with a horizontal bar in between - https://www.myfonts.com/fonts/russian-fonts/rf-rostin/

alt text

Shreeshrii avatar Apr 10 '19 09:04 Shreeshrii

https://graphicdesign.stackexchange.com/questions/71097/fractions-in-indesign-typing-not-%C2%BD-alt-0189 has a short list:

A list of some Google Fonts (all free) that you can use (thanks to @RadLexus):

Coda – by Vernon Adams Telex – by Huerta Tipografica Arbutus Slab – by Karolina Lach Unica One – by Eduardo Tunni Concert One Cherry Swash – by Nataliya Kasatkina Economica – by Vicente Lamonaca Special Elite – by Astigmatic

Shreeshrii avatar Apr 10 '19 10:04 Shreeshrii