feat(lang): Add language-aware text transformation for uppercase and lowercase
Fixes incorrect uppercase/lowercase transformations for Greek and Turkish text when using CSS text-transform (#1749). Python's default .upper() and .lower() don't handle language-specific rules (e.g., Turkish i → İ, Greek accented characters). Modified uppercase() and lowercase() to accept lang parameter from box.style['lang'], added is_greek_lang() and is_turkish_lang() helper functions, and implemented naive character mappings for Turkish (i/İ/ı/I) and Greek accented characters. Tested with Greek (lang="el") and Turkish (lang="tr") HTML files to verify correct transformations and backwards compatibility.
That’s cool, thanks! You got the overall idea and the code is great.
There are some changes I’d like to have before merging. Would you prefer to have a review and do the changes yourself, or should I add commits on top of your pull request so that you can see what I have in mind?
I’d prefer a quick review if possible, though feel free to commit on top if that’s faster.
just cleaned up uppercase and lowercase functions, waiting for your review.
just cleaned up
uppercaseandlowercasefunctions, waiting for your review.
Hmm, my review wasn’t clear. The goal is to define a different mapping for each language with exceptions (with Greek letters for Greek, the dotless i for Turkish) and keep it empty in other languages. This way, you can always do the loop after defining the mapping.
just cleaned up
uppercaseandlowercasefunctions, waiting for your review.Hmm, my review wasn’t clear. The goal is to define a different mapping for each language with exceptions (with Greek letters for Greek, the dotless i for Turkish) and keep it empty in other languages. This way, you can always do the loop after defining the mapping.
I hope i got your point this time.
I guess there remain rules for capitalization, also as Lithuanian and Azeri exceptions.
I'd like to work on capitalization.
Thanks for your code!
I’ve updated the code to handle more cases and add tests. I’ve also added links so that we don’t forget where these rules come from.
I’ve spent a lot (too much?) time to find reliable sources, and the best resource I’ve found is these tests (even if they forgot the red color on some of them). On this page, we should only take care of:
- the part of the "special cases" that depends on the document language,
- the "tailoring".
The other tests should be handled by the standard Unicode algorithm, so, in other words, by Python. The current result is not perfect, but often better than some browsers :smile:. We don’t want to fix these cases.
For uppercase and lowercase, the only test that doesn’t pass is the disjunctive eta. It requires to find word boundaries (as we somehow already do in capitalize) and a reliable source of information that explains the case, so I’ve added a TODO, we can keep that for later.
I'd like to work on capitalization.
No problem!
You can use str.upper() except for:
- Turkish and Azeri that have the same rules than for
uppercase, - Dutch that transforms leading "ij" into "IJ".
Greek keeps the standard rules for capitalisation, using upper is OK.
Here’s a "short" test sample that should cover what we want:
<html lang="en">
<head>
<meta charset="utf-8">
<style type="text/css">
@font-face {
font-family: doulos;
src: url(https://w3c.github.io/i18n-tests/fonts/sil/DoulosSIL-regular.woff2);
}
@font-face {
font-family: gentium;
src: url(https://w3c.github.io/i18n-tests/fonts/sil/GentiumPlus-regular.woff2);
}
.test, .ref { font-size: 200%; line-height: 1.5em; }
html { font-family: doulos; }
[lang=el] { font-family: gentium; }
.test span, .ref span { margin-right: 0; white-space: nowrap; }
.ref { color: red; } .ref { z-index: -100; }
span { display: inline-block; min-width: 2em; }
.ref { position:absolute; top:0; }
.upper .test { text-transform: uppercase; }
.lower .test { text-transform: lowercase; }
.capitalize .test { text-transform: capitalize; }
</style>
</head>
<body>
<p class="instructions">Test passes if you see no red characters.</p>
<div class="lower">
<div style="position:relative" lang="tr">
<div class="test"><span>İ</span> <span>İ</span> <span>I</span></div>
<div class="ref"><span>i</span> <span>i</span> <span>ı</span></div>
</div>
<div style="position:relative" lang="az">
<div class="test"><span>İ</span> <span>İ</span> <span>I</span></div>
<div class="ref"><span>i</span> <span>i</span> <span>ı</span></div>
</div>
<div style="position:relative" lang="lt">
<div class="test"><span>Ì</span> <span>Í</span> <span>Ĩ</span></div>
<div class="ref"><span>i̇̀</span> <span>i̇́</span> <span>i̇̃</span></div>
</div>
</div>
<div class="upper">
<div style="position:relative" lang="tr">
<div class="test"><span>i</span> <span>ı</span></div>
<div class="ref"><span>İ</span> <span>I</span></div>
</div>
<div style="position:relative" lang="az">
<div class="test"><span>i</span> <span>ı</span></div>
<div class="ref"><span>İ</span> <span>I</span></div>
</div>
<div style="position:relative" lang="el">
<div class="test">καλημέρα αύριο</div>
<div class="ref">ΚΑΛΗΜΕΡΑ ΑΥΡΙΟ</div>
</div>
<div style="position:relative" lang="el">
<div class="test">θεϊκό</div>
<div class="ref">ΘΕΪΚΟ</div>
</div>
<div style="position:relative" lang="el">
<div class="test">ευφυΐα Νεράιδα</div>
<div class="ref">ΕΥΦΥΪΑ ΝΕΡΑΪΔΑ</div>
</div>
<div style="position:relative" lang="el">
<div class="test">ήσουν ή εγώ ή εσύ</div>
<div class="ref">ΗΣΟΥΝ Ή ΕΓΩ Ή ΕΣΥ</div>
</div>
</div>
<div class="capitalize">
<div style="position:relative" lang="el">
<div class="test">όμηρος</div>
<div class="ref">Όμηρος</div>
</div>
<div style="position:relative" lang="nl">
<div class="test">ijsland</div>
<div class="ref">IJsland</div>
</div>
</div>
</body>
</html>
Hello @liZe,
Are there any remaining updates that need to be completed?