tessdoc icon indicating copy to clipboard operation
tessdoc copied to clipboard

Clarify language support quality status

Open eyalroz opened this issue 3 years ago • 6 comments

The README.md says tesseract "supports over 100 languages out of the box". But - which languages? And what quality is the support for different languages known to be, out of the box?

It would be helpful if a separate file (or wiki page) would detail, to the extent possible, this information.

eyalroz avatar May 21 '22 07:05 eyalroz

See https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html. All work on Tesseract is currently done by volunteers, so you are invited to find the answers to your questions and document them.

stweil avatar May 21 '22 07:05 stweil

@stweil : Can you linkify the "100 languages" sentence in the README.md to point to that page?

eyalroz avatar May 21 '22 09:05 eyalroz

@eyalroz I went ahead and propsed the change in the tesseract repo: https://github.com/tesseract-ocr/tesseract/pull/4027

I also think it would be very helpful. Even though the list itself has no information on languages in v5 yet.

tooomm avatar Mar 05 '23 14:03 tooomm

Even though the list itself has no information on languages in v5 yet.

There was no update for v5. All the v4 data files should work with Tesseract 5.x.

amitdo avatar Mar 09 '23 10:03 amitdo

There was no update for v5. All the v4 data files should work with Tesseract 5.x.

That's at least not obvious from the table.

The information can be found in other parts of the docs, true. Users can easily miss it though. Language model traineddata files same as listed above for version 4.0.0 can be used with Tesseract 5.x.x.

tooomm avatar Mar 09 '23 19:03 tooomm

https://github.com/tesseract-ocr/docs/blob/main/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf

https://arxiv.org/pdf/2202.13274.pdf

amitdo avatar Sep 07 '23 13:09 amitdo