polyglossia
polyglossia copied to clipboard
Support Latin <-> Cyrillic transliteration and Latin digraphs for Serbian
We do support multiple scripts (same in in- and output) via the
script
option. We do not have a case yet where we support transliteration, though.Wikipedia tells me that three scripts are common in different regions: Arabic, Latin, and Cyrillic. Given this, a
script
option would make sense.
Originally posted by @jspitz in https://github.com/reutenauer/polyglossia/issues/482#issuecomment-803815864
I am not sure if the upper comment means that the transliteration is considered for polyglossia's future...
Here are the bidirectional Unicode mappings for Serbian to start with.
serbian_cyrillic-latin_transliteration.xlsx
Note:
- To be precise, there are no Cyrillic digraphs in Serbian (Љ, љ, Њ and њ can be considered as digraph-like letter pairs merged into single characters).
- On top of that, there are no Cyrillic Title case variants.
- The same mechanism as for Croatian should be used for Latin digraphs (checks, fallbacks to separate characters, options, and shorthands - at least for digraphs). See #216.
- Cyrillic Serbian "digraphs" are widely used and available within Cyrillic fonts (even within
T2A
) and keyboard layout, i.e. no checks nor fallbacks to separate characters must be implemented. - Mappings are almost completely bijective, except the 3 mappings where the Latin Title case digraphs must be mapped to Cyrillic Upper case characters (there is no Title case for Cyrillic at all).
- There are no Latin digraphs nor Cyrillic "digraphs" present in
gloss-serbian.ldf
, good - nothing to take care of.
Some good examples to eventually test with:
- аАбБвВгГдДђЂжЖћЋчЧшШ <-> aAbBvVgGdDđĐžŽćĆčČšŠ
- љЉњЊџЏ -> ljLJnjNJdžDŽ (Latin digraphs if
disableligatures
isfalse
) - љЉњЊџЏ -> ljLJnjNJdžDŽ (separate characters if the font is missing Latin digraphs or
disableligatures
istrue
) - ljLJnjNJdžDŽ (separate characters) -> лјЛЈнјНЈджДЖ (separate characters)
- "lj"Lj"LJ"nj"Nj"NJ"dž"Dž"DŽ (shorthands with separate characters) -> љЉЉњЊЊџЏЏ
- ljLjLJnjNjNJdžDžDŽ (Latin digraphs) -> љЉЉњЊЊџЏЏ.
I am not sure if the upper comment means that the transliteration is considered for polyglossia's future...
Many things are possible if someone steps up and does the implementation.
@yannis1962 has prepared map files based on my contribution here. We'll see what happens next...
I have prepared map files for Latin->Cyrillic and Cyrillic->Latin in the case of Serbian.
The only flaw I see is that when I have Љ Њ Џ as input, I can send them either to LJ NJ DŽ (uppercase) or to Lj Nj Dž (titlecase). I added a context rule so that Љ Њ Џ followed by a lowercase letter is always sent to titlecase, and otherwise to uppercase.
I need confirmation by native speakers that this is a good choice.
For example, what happens when somebody has a given name starting with Љ? When I transliterate the initial “Љ.” into Latin I will get “LJ.” which is obviously bad, but is the correct way to write the initial in that case “Lj.” or rather “L.” ?
Maybe should I implement another rule saying that when Љ is not preceded by a capital letter and followed by a period, it should be titlecase?
I need help from native speakers…
I'm including the MAP and TEC files, as well as two test files with the UHRD in Serbian (converted from Latin to Cyrillic and from Cyrillic to Latin) in TeX and PDF format. You will need to use some other font if you run them (XeTeX only).
The only flaw I see is that when I have Љ Њ Џ as input, I can send them either to LJ NJ DŽ (uppercase) or to Lj Nj Dž (titlecase). I added a context rule so that Љ Њ Џ followed by a lowercase letter is always sent to titlecase, and otherwise to uppercase.
I need confirmation by native speakers that this is a good choice.
I am not a native speaker/writer but it looks OK.
For example, what happens when somebody has a given name starting with Љ? When I transliterate the initial “Љ.” into Latin I will get “LJ.” which is obviously bad, but is the correct way to write the initial in that case “Lj.” or rather “L.” ?
Here I can contribute with the explicit rule:
Правопис српскога језика, Матица српска, 1994. (друго издање)
https://sr.wikipedia.org/sr-el/%D0%9F%D1%80%D0%B0%D0%B2%D0%BE%D0%BF%D0%B8%D1%81_%D1%81%D1%80%D0%BF%D1%81%D0%BA%D0%BE%D0%B3%D0%B0_%D1%98%D0%B5%D0%B7%D0%B8%D0%BA%D0%B0
https://gimnazijadg.files.wordpress.com/2012/03/pravopis-srpskoga-jezika.pdf
Free translation: Latin digraphs used as starting letters in a sentence, a given name, or an abbreviation must be written as given in Table 8: Dž, Lj, Nj; but as DŽ, LJ, NJ in fully uppercase context (to emphasize).
I guess there are no changes in newer editions.
Maybe should I implement another rule saying that when Љ is not preceded by a capital letter and followed by a period, it should be titlecase?
I need help from native speakers…
Definitely, let us wait until then...
As I suspected. So that raises the question: how do I force the transcription into titlecase?
How about using a LaTeX macro \titlecase{Љ} to be sure you will get a titlecase, no matter what follows?
Le 23 mars 2021 à 14:45, Ivan Kokan @.***> a écrit :
The only flaw I see is that when I have Љ Њ Џ as input, I can send them either to LJ NJ DŽ (uppercase) or to Lj Nj Dž (titlecase). I added a context rule so that Љ Њ Џ followed by a lowercase letter is always sent to titlecase, and otherwise to uppercase.
I need confirmation by native speakers that this is a good choice. I am not a native speaker/writer but it looks OK.
For example, what happens when somebody has a given name starting with Љ? When I transliterate the initial “Љ.” into Latin I will get “LJ.” which is obviously bad, but is the correct way to write the initial in that case “Lj.” or rather “L.” ? Here I can contribute with the explicit rule: Правопис српскога језика, Матица српска, 1994. (друго издање) https://sr.wikipedia.org/sr-el/%D0%9F%D1%80%D0%B0%D0%B2%D0%BE%D0%BF%D0%B8%D1%81_%D1%81%D1%80%D0%BF%D1%81%D0%BA%D0%BE%D0%B3%D0%B0_%D1%98%D0%B5%D0%B7%D0%B8%D0%BA%D0%B0 https://sr.wikipedia.org/sr-el/%D0%9F%D1%80%D0%B0%D0%B2%D0%BE%D0%BF%D0%B8%D1%81_%D1%81%D1%80%D0%BF%D1%81%D0%BA%D0%BE%D0%B3%D0%B0_%D1%98%D0%B5%D0%B7%D0%B8%D0%BA%D0%B0 https://gimnazijadg.files.wordpress.com/2012/03/pravopis-srpskoga-jezika.pdf https://gimnazijadg.files.wordpress.com/2012/03/pravopis-srpskoga-jezika.pdf https://user-images.githubusercontent.com/1058211/112155113-46692680-8be5-11eb-8955-86b5763c7f46.png Free translation: Latin digraphs used as starting letters in a sentence, a given name, or an abbreviation must be written as given in Table 8: Dž, Lj, Nj; but as DŽ, NJ, LJ in fully uppercase context (to emphasize).
I guess there are no changes in newer editions.
Maybe should I implement another rule saying that when Љ is not preceded by a capital letter and followed by a period, it should be titlecase?
I need help from native speakers… Definitely, let us wait until then...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/reutenauer/polyglossia/issues/483#issuecomment-804914820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFXC7M2KJA5VHKEMLPKRFDTFCLOPANCNFSM4ZUE7NYA.
http://www.imt-atlantique.fr/ Yannis HARALAMBOUS Professor Computer Science Department UMR CNRS 6285 Lab-STICC http://perso.telecom-bretagne.eu/yannisharalambous/ https://twitter.com/y_haralambous https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-nameTechnopôle Brest-Iroise CS 83818 29238 Brest Cedex 3, France Une école de l'IMT http://www.imt.fr/ — Vous cherchez trop à comprendre, c'est un grave défaut. — J'ai déjà entendu cette phrase. — Vous l'avez écrite. (Jean Cocteau)
As I suspected. So that raises the question: how do I force the transcription into titlecase? How about using a LaTeX macro \titlecase{Љ} to be sure you will get a titlecase, no matter what follows?
"Smart ways": transliterate to titlecase if it is followed by something lowercase (starting a sentence) or a period (initials/abbreviations). This would obviously fail with a sentence having simply "Љ" as its first word.
I think that macro is inevitable in any case, hence no "smart way" must be implemented.
Le 23 mars 2021 à 14:59, Ivan Kokan @.***> a écrit :
"Smart ways": transliterate to titlecase if it is followed by something lowercase (starting a sentence) or a period (initials/abbreviations). This would obviously fail with a sentence having simply "Љ" as its first word.
What I have done is:
-
titlecase if followed by lowercase
-
uppercase if preceded by uppercase
-
titlecase if not (2) and followed by period
These three rules should cover most of the cases…
I have been in contact with Uroš Stefanović (https://ctan.org/author/stefanovic) meanwhile. It seems we are getting somewhere with this implementation.
Let me just summarize what we currently have:
- map files prepared by Yannis Haralambous (@yannis1962), XeTeX only, including three smart rules on how to transliterate from Cyrillic uppercase to Latin:
- titlecase if followed by lowercase
- uppercase if preceded by uppercase
- titlecase if not 2. and followed by a period
- enriched set of small test examples (spaces are added so that the rules 1.-3. do not transliterate wrongly):
- а А б Б в В г Г д Д ђ Ђ ж Ж ћ Ћ ч Ч ш Ш <-> a A b B v V g G d D đ Đ ž Ž ć Ć č Č š Š
- љ Љ њ Њ џ Џ -> lj LJ nj NJ dž DŽ (Latin digraphs if
disableligatures
isfalse
) - љ Љ њ Њ џ Џ -> lj LJ nj NJ dž DŽ (separate characters if the font is missing Latin digraphs or
disableligatures
istrue
) - lj Lj LJ nj Nj NJ dž Dž DŽ (separate characters) -> лј Лј ЛЈ нј Нј НЈ дж Дж ДЖ (separate characters)
- "lj "Lj "LJ "nj "Nj "NJ "dž "Dž "DŽ (shorthands with separate characters) -> љ Љ Љ њ Њ Њ џ Џ Џ
- lj Lj LJ nj Nj NJ dž Dž DŽ (Latin digraphs) -> љ Љ Љ њ Њ Њ џ Џ Џ
- more test examples to test smart rules (each one in two variants depending on
disableligatures
):- ЉУДИ -> LJUDI / LJUDI (none rule would be applied)
- Љубљана -> Ljubljana / Ljubljana (rule 1.)
- КОЊ -> KONJ / KONJ (rule 2.)
- Џ. Костанза -> Dž. Kostanza / Dž. Kostanza (rule 3.)
- one wants Џ. КОСТАНЗА -> DŽ. KOSTANZA / DŽ. KOSTANZA (rule 3. would be wrongly applied producing Dž / Dž, one would need to use something like
\uppercase{Џ}
) - one wants Љ -> Lj / Lj (none rule would be wrongly applied producing LJ / LJ, one would need to use something like
\titlecase{Љ}
) - ADDED: one wants Џ. К О С Т А Н З А -> D Ž. K O S T A N Z A (rule 3. would be wrongly applied producing Dž / Dž, one would need to use something like
\uppercase[separate]{Џ}
) - ADDED: one wants Љ У Б Љ А Н А -> L J U B L J A N A (none rule would be wrongly applied producing LJ U B LJ A N A / LJ U B LJ A N A, one would need to use something like
\uppercase[separate]{Љ}
) - ADDED: one wants Љ у б љ а н а -> L j u b l j a n a (none rule would be wrongly applied producing LJ u b lj a n a / LJ u b lj a n a, one would need to use something like
\titlecase[separate]{Љ}
and\lowercase[separate]{љ}
)
TODO:
- integrate Yannis' map files
- Yannis Haralambous (@yannis1962) should eventually be acknowledged as a contributor in the manual
- LuaTeX transliteration support - can someone provide references on how to achieve the same?
- take over all
serbian
/serbianc
babelshorthands - add digraphs ligatures shorthands (like in Croatian, be careful with
"D
and"d
as such babelshorthands already exist forĐ
/đ
) - add support for explicit uppercase -> uppercase / titlecase transliteration in Cyrillic -> Latin direction
I guess that's all.
As for LuaTeX: Look at how ArabLuaTeX does it.
More specifically: https://tex.stackexchange.com/questions/285610/
I have found two additional rules: Правопис српскога језика, Матица српска, 2010. (измењено и допуњено, четврто издање) https://jelenaradomir.files.wordpress.com/2016/08/pravopis-ms_2010.pdf
При размакнутом (спационираном) писању сва слова се једнако раздвајају (L j u b l j a n a а не Lj u b lj a n a). Ако се натписи (нпр. MENJAČNICA) пишу одозго надоле, NJ, LJ односно DŽ не треба да остану састављени, него друго слово долази испод првог.
Google Translate (a bit improved): With an increased letter spacing (separated characters), all glyphs are equally separated (L j u b l j a n a, not Lj u b lj a n a). If the inscriptions (e.g. MENJAČNICA) are written from top to bottom, NJ, LJ or DŽ should not remain composed, but the second letter comes below the first instead.
I would tell that the first rule is feasible providing optional arguments separate
to the future macros \uppercase{Љ}
and \titlecase{Љ}
. (I edited my previous comment that summarizes everything: https://github.com/reutenauer/polyglossia/issues/483#issuecomment-818276717.)
The second rule is way off polyglossia
's scope.