tesseract
tesseract copied to clipboard
Recognising mathematics
Your Feature Request
Proposals for recognising mathematics are often far too elaborate, influenced by LaTeX. The goal should be to recognise what could be printed in journals in the old days of typesetting, which was quite simple. Here is my description of this:
https://mathoverflow.net/q/178721
which contains references to accounts of mathematical typography from the 1960s.
There are now huge numbers of old journal papers that have been scanned. It would be good if tesseract could read them and generate basic LaTeX.
Before WW2 maths journals were far more often in other languages, especially German. Having a good digital version in the original language would make it possible to obtain translations.
There is one technological leader for Math-OCR: mathpix, a commercial service. They use the same technology as Tesseract: CTC/LSTM. But the task is very special.
Tesseract is specialised on text in lines. It can't even classify ROIs (regions of interest) like drawings, tables, music notation, math, pictures.
As a note about the typography: Puzzling metal pieces together was an established method for e.g. typesetting liturgical Hebrew with all the "accents" at least in early 18th century (Gessner, ~1744 describes it).
At the time of Galileo and Albrecht Dürer they used woodcutting, had no special formula notation.
Early (1863) sample of 2-D formula: https://archive.org/details/bub_gb_fDMAAAAAQAAJ/page/120/mode/1up
Modern notation: 1908 http://resolver.sub.uni-goettingen.de/purl?PPN243919689_0134 1913 http://resolver.sub.uni-goettingen.de/purl?PPN37721857X_0022
Don't expect math recognition in Tesseract to get improved in the foreseeable future.
For mathematical expressions, I propose using Nougat (Neural Optical Understanding for Academic Documents), a ViT model that performs an OCR task for processing scientific documents into a markup language.
Paper: https://arxiv.org/abs/2308.13418 Code: https://github.com/facebookresearch/nougat
I don't think that we should keep this feature request open.
Any feature request that has close to zero chance to be implemented by the current team in the next 5 years should be closed.