text icon indicating copy to clipboard operation
text copied to clipboard

Missing encodings: Tatar

Open rohkea opened this issue 4 years ago • 3 comments

Two encodings that are missing from the list of available encodings are the Tatar encodings as specified by Resolution of the Cabinet of Ministers of the Republic of Tatarstan from the 9th of December, 1996 No. 1026 "On the standards of the encoding symbols of Tatar alphabet for computer applications"

The document specifies an ‘ASCII’ version (for DOS and console applications; compatible with cp866) and ‘ANSI’ version (for Windows applications; compatible with cp1251).

I don't think there are established codes for these encodings. I would suggest something along the lines of cp1251-tatar and cp866-tatar.


‘ANSI’ version (variant of cp1251)

The scans of the actual resolution are hard to read, but the ASCII version is described in Russian Wikipedia's article on cp1251, under the subheading «Татарский вариант». Here are the rows that are different from cp1251:

0x80 = Ә (U+04D8), 0x8A = Ө (U+04E8), 0x8C = Ү (U+04AE), 0x8D = Җ (U+0496), 0x8E = Ң (U+04A2), 0x8F = Һ (U+04BA), 0x90 = ә (U+04D9), 0x9A = ө (U+04E9), 0x9C = ү (U+04AF),m 0x9D = җ (U+0497), 0x9e = ң (U+04A3), 0x9F = һ (U+04BB).

Basically, the letters that differ from cp1251 are:

0x80 = Ә (U+04D8)
0x8A = Ө (U+04E8)
0x8C = Ү (U+04AE)
0x8D = Җ (U+0496)
0x8E = Ң (U+04A2)
0x8F = Һ (U+04BA)
0x90 = ә (U+04D9)
0x9A = ө (U+04E9)
0x9C = ү (U+04AF)
0x9D = җ (U+0497)
0x9e = ң (U+04A3)
0x9F = һ (U+04BB)

This encoding can still be encountered in the internet. For example, search for хђзер (for хәзер 'now'), кирђк (for кирәк 'needed') or мљмкин (for мөмкин 'possible') in Google to find some example usage. Some fonts for this encoding can be found here: https://kashapovnail.ucoz.ru/load/1-1-0-1 (they would usually replace Southern Slavic letters with Tatar letters, so they're not really Unicode-compatible; it's a hard because old Windows version didn't support other way to input this encoding).


‘ASCII’ version (variant of cp866)

I don’t know how widely this encoding is used. But the document provides information about it, so it can be implemented:

image

It's basically cp866 with the following changes (0xF0 and 0xF1 might not be the change depending on what you consider the 'basic variant' of cp866):

0xF0 = Ё (U+0401)
0xF1 = ё (U+0451)
0xF2 = Ә (U+04D8)
0xF3 = Ө (U+04E8)
0xF4 = Ү (U+04AE)
0xF5 = Җ (U+0496)
0xF6 = Ң (U+04A2)
0xF7 = Һ (U+04BA)
0xF8 = ә (U+04D9)
0xF9 = ө (U+04E9)
0xFA = ү (U+04AF)
0xFB = җ (U+0497)
0xFC = ң (U+04A3)
0xFD = һ (U+04BB)

rohkea avatar Jul 03 '21 13:07 rohkea

This is pretty cool! I'll add it to the list in the documentation, and maybe someone can attempt to serialize such an encoding!

ThePhD avatar Jul 03 '21 14:07 ThePhD

One final question: do you know if there are any documents lying around encoded with this data already that I can use? If not I'll just mock something up, but it would be nice to have some known base files to work with so I can ensure I'm getting it right!

ThePhD avatar Jul 22 '21 16:07 ThePhD

For example, here is such text: http://www.mtss.ru/course/lesson_05.html

rohkea avatar Jul 22 '21 17:07 rohkea

WOW this took me forever BUT it is finished and you can use both the ANSI and ASCII variants in code with ztd::text::tatar_ascii and ztd::text::tatar_ansi in code now!!

Thanks or the issue request!!

ThePhD avatar May 11 '23 21:05 ThePhD