polyglossia icon indicating copy to clipboard operation
polyglossia copied to clipboard

Support Latin <-> Cyrillic transliteration and Latin digraphs for Serbian

Open ivankokan opened this issue 3 years ago • 12 comments

We do support multiple scripts (same in in- and output) via the script option. We do not have a case yet where we support transliteration, though.

Wikipedia tells me that three scripts are common in different regions: Arabic, Latin, and Cyrillic. Given this, a script option would make sense.

Originally posted by @jspitz in https://github.com/reutenauer/polyglossia/issues/482#issuecomment-803815864

I am not sure if the upper comment means that the transliteration is considered for polyglossia's future...

Here are the bidirectional Unicode mappings for Serbian to start with.

serbian_cyrillic-latin_transliteration.xlsx

Note:

  • To be precise, there are no Cyrillic digraphs in Serbian (Љ, љ, Њ and њ can be considered as digraph-like letter pairs merged into single characters).
  • On top of that, there are no Cyrillic Title case variants.
  • The same mechanism as for Croatian should be used for Latin digraphs (checks, fallbacks to separate characters, options, and shorthands - at least for digraphs). See #216.
  • Cyrillic Serbian "digraphs" are widely used and available within Cyrillic fonts (even within T2A) and keyboard layout, i.e. no checks nor fallbacks to separate characters must be implemented.
  • Mappings are almost completely bijective, except the 3 mappings where the Latin Title case digraphs must be mapped to Cyrillic Upper case characters (there is no Title case for Cyrillic at all).
  • There are no Latin digraphs nor Cyrillic "digraphs" present in gloss-serbian.ldf, good - nothing to take care of.

Some good examples to eventually test with:

  • аАбБвВгГдДђЂжЖћЋчЧшШ <-> aAbBvVgGdDđĐžŽćĆčČšŠ
  • љЉњЊџЏ -> ljLJnjNJdžDŽ (Latin digraphs if disableligatures is false)
  • љЉњЊџЏ -> ljLJnjNJdžDŽ (separate characters if the font is missing Latin digraphs or disableligatures is true)
  • ljLJnjNJdžDŽ (separate characters) -> лјЛЈнјНЈджДЖ (separate characters)
  • "lj"Lj"LJ"nj"Nj"NJ"dž"Dž"DŽ (shorthands with separate characters) -> љЉЉњЊЊџЏЏ
  • ljLjLJnjNjNJdžDžDŽ (Latin digraphs) -> љЉЉњЊЊџЏЏ.

ivankokan avatar Mar 23 '21 00:03 ivankokan

I am not sure if the upper comment means that the transliteration is considered for polyglossia's future...

Many things are possible if someone steps up and does the implementation.

jspitz avatar Mar 23 '21 12:03 jspitz

@yannis1962 has prepared map files based on my contribution here. We'll see what happens next...

ivankokan avatar Mar 23 '21 12:03 ivankokan

I have prepared map files for Latin->Cyrillic and Cyrillic->Latin in the case of Serbian.

The only flaw I see is that when I have Љ Њ Џ as input, I can send them either to LJ NJ DŽ (uppercase) or to Lj Nj Dž (titlecase). I added a context rule so that Љ Њ Џ followed by a lowercase letter is always sent to titlecase, and otherwise to uppercase.

I need confirmation by native speakers that this is a good choice.

For example, what happens when somebody has a given name starting with Љ? When I transliterate the initial “Љ.” into Latin I will get “LJ.” which is obviously bad, but is the correct way to write the initial in that case “Lj.” or rather “L.” ?

Maybe should I implement another rule saying that when Љ is not preceded by a capital letter and followed by a period, it should be titlecase?

I need help from native speakers…

I'm including the MAP and TEC files, as well as two test files with the UHRD in Serbian (converted from Latin to Cyrillic and from Cyrillic to Latin) in TeX and PDF format. You will need to use some other font if you run them (XeTeX only).

Archive.zip

yannis1962 avatar Mar 23 '21 13:03 yannis1962

The only flaw I see is that when I have Љ Њ Џ as input, I can send them either to LJ NJ DŽ (uppercase) or to Lj Nj Dž (titlecase). I added a context rule so that Љ Њ Џ followed by a lowercase letter is always sent to titlecase, and otherwise to uppercase.

I need confirmation by native speakers that this is a good choice.

I am not a native speaker/writer but it looks OK.

For example, what happens when somebody has a given name starting with Љ? When I transliterate the initial “Љ.” into Latin I will get “LJ.” which is obviously bad, but is the correct way to write the initial in that case “Lj.” or rather “L.” ?

Here I can contribute with the explicit rule: Правопис српскога језика, Матица српска, 1994. (друго издање) https://sr.wikipedia.org/sr-el/%D0%9F%D1%80%D0%B0%D0%B2%D0%BE%D0%BF%D0%B8%D1%81_%D1%81%D1%80%D0%BF%D1%81%D0%BA%D0%BE%D0%B3%D0%B0_%D1%98%D0%B5%D0%B7%D0%B8%D0%BA%D0%B0 https://gimnazijadg.files.wordpress.com/2012/03/pravopis-srpskoga-jezika.pdf image

Free translation: Latin digraphs used as starting letters in a sentence, a given name, or an abbreviation must be written as given in Table 8: Dž, Lj, Nj; but as DŽ, LJ, NJ in fully uppercase context (to emphasize).

I guess there are no changes in newer editions.

Maybe should I implement another rule saying that when Љ is not preceded by a capital letter and followed by a period, it should be titlecase?

I need help from native speakers…

Definitely, let us wait until then...

ivankokan avatar Mar 23 '21 13:03 ivankokan

As I suspected. So that raises the question: how do I force the transcription into titlecase?

How about using a LaTeX macro \titlecase{Љ} to be sure you will get a titlecase, no matter what follows?

Le 23 mars 2021 à 14:45, Ivan Kokan @.***> a écrit :

The only flaw I see is that when I have Љ Њ Џ as input, I can send them either to LJ NJ DŽ (uppercase) or to Lj Nj Dž (titlecase). I added a context rule so that Љ Њ Џ followed by a lowercase letter is always sent to titlecase, and otherwise to uppercase.

I need confirmation by native speakers that this is a good choice. I am not a native speaker/writer but it looks OK.

For example, what happens when somebody has a given name starting with Љ? When I transliterate the initial “Љ.” into Latin I will get “LJ.” which is obviously bad, but is the correct way to write the initial in that case “Lj.” or rather “L.” ? Here I can contribute with the explicit rule: Правопис српскога језика, Матица српска, 1994. (друго издање) https://sr.wikipedia.org/sr-el/%D0%9F%D1%80%D0%B0%D0%B2%D0%BE%D0%BF%D0%B8%D1%81_%D1%81%D1%80%D0%BF%D1%81%D0%BA%D0%BE%D0%B3%D0%B0_%D1%98%D0%B5%D0%B7%D0%B8%D0%BA%D0%B0 https://sr.wikipedia.org/sr-el/%D0%9F%D1%80%D0%B0%D0%B2%D0%BE%D0%BF%D0%B8%D1%81_%D1%81%D1%80%D0%BF%D1%81%D0%BA%D0%BE%D0%B3%D0%B0_%D1%98%D0%B5%D0%B7%D0%B8%D0%BA%D0%B0 https://gimnazijadg.files.wordpress.com/2012/03/pravopis-srpskoga-jezika.pdf https://gimnazijadg.files.wordpress.com/2012/03/pravopis-srpskoga-jezika.pdf https://user-images.githubusercontent.com/1058211/112155113-46692680-8be5-11eb-8955-86b5763c7f46.png Free translation: Latin digraphs used as starting letters in a sentence, a given name, or an abbreviation must be written as given in Table 8: Dž, Lj, Nj; but as DŽ, NJ, LJ in fully uppercase context (to emphasize).

I guess there are no changes in newer editions.

Maybe should I implement another rule saying that when Љ is not preceded by a capital letter and followed by a period, it should be titlecase?

I need help from native speakers… Definitely, let us wait until then...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/reutenauer/polyglossia/issues/483#issuecomment-804914820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFXC7M2KJA5VHKEMLPKRFDTFCLOPANCNFSM4ZUE7NYA.

http://www.imt-atlantique.fr/ Yannis HARALAMBOUS Professor Computer Science Department UMR CNRS 6285 Lab-STICC http://perso.telecom-bretagne.eu/yannisharalambous/ https://twitter.com/y_haralambous https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-nameTechnopôle Brest-Iroise CS 83818 29238 Brest Cedex 3, France Une école de l'IMT http://www.imt.fr/ — Vous cherchez trop à comprendre, c'est un grave défaut. — J'ai déjà entendu cette phrase. — Vous l'avez écrite. (Jean Cocteau)

yannis1962 avatar Mar 23 '21 13:03 yannis1962

As I suspected. So that raises the question: how do I force the transcription into titlecase? How about using a LaTeX macro \titlecase{Љ} to be sure you will get a titlecase, no matter what follows?

"Smart ways": transliterate to titlecase if it is followed by something lowercase (starting a sentence) or a period (initials/abbreviations). This would obviously fail with a sentence having simply "Љ" as its first word.

I think that macro is inevitable in any case, hence no "smart way" must be implemented.

ivankokan avatar Mar 23 '21 13:03 ivankokan

Le 23 mars 2021 à 14:59, Ivan Kokan @.***> a écrit :

"Smart ways": transliterate to titlecase if it is followed by something lowercase (starting a sentence) or a period (initials/abbreviations). This would obviously fail with a sentence having simply "Љ" as its first word.

What I have done is:

  1. titlecase if followed by lowercase

  2. uppercase if preceded by uppercase

  3. titlecase if not (2) and followed by period

These three rules should cover most of the cases…

yannis1962 avatar Mar 23 '21 14:03 yannis1962

Here are the files with the three smart rules mentioned in the previous message

Archive.zip

yannis1962 avatar Mar 23 '21 14:03 yannis1962

I have been in contact with Uroš Stefanović (https://ctan.org/author/stefanovic) meanwhile. It seems we are getting somewhere with this implementation.

Let me just summarize what we currently have:

  • map files prepared by Yannis Haralambous (@yannis1962), XeTeX only, including three smart rules on how to transliterate from Cyrillic uppercase to Latin:
  1. titlecase if followed by lowercase
  2. uppercase if preceded by uppercase
  3. titlecase if not 2. and followed by a period
  • enriched set of small test examples (spaces are added so that the rules 1.-3. do not transliterate wrongly):
    • а А б Б в В г Г д Д ђ Ђ ж Ж ћ Ћ ч Ч ш Ш <-> a A b B v V g G d D đ Đ ž Ž ć Ć č Č š Š
    • љ Љ њ Њ џ Џ -> lj LJ nj NJ dž DŽ (Latin digraphs if disableligatures is false)
    • љ Љ њ Њ џ Џ -> lj LJ nj NJ dž DŽ (separate characters if the font is missing Latin digraphs or disableligatures is true)
    • lj Lj LJ nj Nj NJ dž Dž DŽ (separate characters) -> лј Лј ЛЈ нј Нј НЈ дж Дж ДЖ (separate characters)
    • "lj "Lj "LJ "nj "Nj "NJ "dž "Dž "DŽ (shorthands with separate characters) -> љ Љ Љ њ Њ Њ џ Џ Џ
    • lj Lj LJ nj Nj NJ dž Dž DŽ (Latin digraphs) -> љ Љ Љ њ Њ Њ џ Џ Џ
  • more test examples to test smart rules (each one in two variants depending on disableligatures):
    • ЉУДИ -> LJUDI / LJUDI (none rule would be applied)
    • Љубљана -> Ljubljana / Ljubljana (rule 1.)
    • КОЊ -> KONJ / KONJ (rule 2.)
    • Џ. Костанза -> Dž. Kostanza / Dž. Kostanza (rule 3.)
    • one wants Џ. КОСТАНЗА -> DŽ. KOSTANZA / DŽ. KOSTANZA (rule 3. would be wrongly applied producing Dž / Dž, one would need to use something like \uppercase{Џ})
    • one wants Љ -> Lj / Lj (none rule would be wrongly applied producing LJ / LJ, one would need to use something like \titlecase{Љ})
    • ADDED: one wants Џ. К О С Т А Н З А -> D Ž. K O S T A N Z A (rule 3. would be wrongly applied producing Dž / Dž, one would need to use something like \uppercase[separate]{Џ})
    • ADDED: one wants Љ У Б Љ А Н А -> L J U B L J A N A (none rule would be wrongly applied producing LJ U B LJ A N A / LJ U B LJ A N A, one would need to use something like \uppercase[separate]{Љ})
    • ADDED: one wants Љ у б љ а н а -> L j u b l j a n a (none rule would be wrongly applied producing LJ u b lj a n a / LJ u b lj a n a, one would need to use something like \titlecase[separate]{Љ} and \lowercase[separate]{љ})

TODO:

  • integrate Yannis' map files
  • Yannis Haralambous (@yannis1962) should eventually be acknowledged as a contributor in the manual
  • LuaTeX transliteration support - can someone provide references on how to achieve the same?
  • take over all serbian/serbianc babelshorthands
  • add digraphs ligatures shorthands (like in Croatian, be careful with "D and "d as such babelshorthands already exist for Đ/đ)
  • add support for explicit uppercase -> uppercase / titlecase transliteration in Cyrillic -> Latin direction

I guess that's all.

ivankokan avatar Apr 12 '21 22:04 ivankokan

As for LuaTeX: Look at how ArabLuaTeX does it.

jspitz avatar Apr 13 '21 06:04 jspitz

More specifically: https://tex.stackexchange.com/questions/285610/

jspitz avatar Apr 17 '21 14:04 jspitz

I have found two additional rules: Правопис српскога језика, Матица српска, 2010. (измењено и допуњено, четврто издање) https://jelenaradomir.files.wordpress.com/2016/08/pravopis-ms_2010.pdf

image

При размакнутом (спационираном) писању сва слова се једнако раздвајају (L j u b l j a n a а не Lj u b lj a n a). Ако се натписи (нпр. MENJAČNICA) пишу одозго надоле, NJ, LJ односно DŽ не треба да остану састављени, него друго слово долази испод првог.

Google Translate (a bit improved): With an increased letter spacing (separated characters), all glyphs are equally separated (L j u b l j a n a, not Lj u b lj a n a). If the inscriptions (e.g. MENJAČNICA) are written from top to bottom, NJ, LJ or DŽ should not remain composed, but the second letter comes below the first instead.

I would tell that the first rule is feasible providing optional arguments separate to the future macros \uppercase{Љ} and \titlecase{Љ}. (I edited my previous comment that summarizes everything: https://github.com/reutenauer/polyglossia/issues/483#issuecomment-818276717.)

The second rule is way off polyglossia's scope.

ivankokan avatar Apr 20 '21 13:04 ivankokan