tessdata_best Tatar file (tat.traineddata) does not appear to be for the standard Cyrillic script

Tatar file (tat.traineddata) does not appear to be for the standard Cyrillic script

Open CRCulver opened this issue 5 months ago • 0 comments

trafficstars

The tessdata collection includes a file tat.traineddata. Although several Turkic languages spoken across Eurasia are referred to as “Tatar”, the ISO 639-2 code tat is generally used to refer to the Kazan Tatar language, spoken in and around the Republic of Tatarstan in Russian. The Kazan Tatar standard language has, since 1939, used a Cyrillic alphabet. This is the script in which all the myriad books written in Tatar since 1939 have been printed.

However, tat.traineddata does not actually work on Tatar in this standard Cyrillic script. Running Tesseract with the argument --language tat on a post-1939 book from the Soviet Union or Russia fails to recognize the script as Cyrillic, and instead outputs gibberish in the Latin alphabet. Attached is a page from a collection of Tatar texts (Galieva, Татар теленнән текстлар, Kazan, 2010) as an example.

tatar-sample.pdf

The provenance of this trained data file, and the exact language and script it was trained for, should be clarified and the file should be renamed to something more specific.

Jun 20 '25 17:06 CRCulver

tessdata_best tessdata_best copied to clipboard

Tatar file (tat.traineddata) does not appear to be for the standard Cyrillic script

tessdata_best
tessdata_best copied to clipboard