free-programming-books icon indicating copy to clipboard operation
free-programming-books copied to clipboard

Solve some problems with RTL Languages

Open AhmedElTabarani opened this issue 3 years ago • 12 comments

Here I will talk about some problems with RTL Languages and their solutions. I will explain all the points here, and we can have a discussion about it. And maybe add a section that talking about these problems & solution in Guidelines in CONTRIBUTING

The base discussion on this issue starts on this PR https://github.com/EbookFoundation/free-programming-books/pull/6706 and https://github.com/EbookFoundation/free-programming-books/pull/6715

What is the issue ?

If we have an RTL text here

* [تعلم البرمجة](URL) - Author Name

Note : تعلم البرمجة means that Learn Programming

It will appear on the website like this: image

In this case, we can just dir="rtl"

<div dir="rtl">

* [تعلم البرمجة](URL) - Author Name
</div>

Result: image

Is that it ?, No! The monster will show up below 😢

Mixing RTL with LRT languages issue !

The real problem is that when mixing RTL with LRT languages

Case 1

<div dir="rtl">

* [تعلم HTML](URL) - Author Name
</div>

Note : تعلم means that Learn.

Result: image

Look, he put words in the mixer!

Case 2

If we need to make LTR to go right (both author name and title are LTR)

<div dir="rtl">

* [Learn HTML](URL) - Author Name
</div>

Result: image

Both words have been swapped!!

Solution ?

We can solve these two problems with Unicode mark called RLM: https://en.wikipedia.org/wiki/Right-to-left_mark

By adding &rlm; after the LRT word that we need to mark it as RTL (it will pretend as RTL word)

Solve case 1

<div dir="rtl">

* [تعلم HTML&rlm;](URL) - Author Name
</div>

Result: image

We added &rlm; after HTML

Solve case 2

<div dir="rtl">

* [Learn HTML&rlm;](URL) - Author Name
</div>

Result: image

You get the point!

Extra Cases!

Case 1

Try to make C# go to right!

<div dir="rtl">

* C#
* [تعلم لغة C# الرائعة](URL) - إسم المؤلف
</div>

Note: * [تعلم لغة C# الرائعة](URL) - إسم المؤلف means * [Learn the Cool C# Language] (URL) - Author Name

Result: image

The Symbols have the same problem when we try to RTL it And it has the same solution 😉, by LRM Unicode mark: https://en.wikipedia.org/wiki/Left-to-right_mark

<div dir="rtl">

* C#&lrm;
* [تعلم لغة C#&lrm; الرائعة](URL) - إسم المؤلف
</div>

We use &lrm not &rlm, why? The issue with the symbol is that when we try to add a RTL attribute to C# to make it get to right It will render as a RTL word, so the symbol will reorder to the other side.

By adding &lrm; after the C# we mark it as LTR word, so it will render as LTR word

Case 1.1

Both Author Name and Title are LTR and end with a symbol as C#

<div dir="rtl">

* [Learn C#](URL) - Author Name
</div>

Result: image

The first here will be simple, just put &rlm; at the end of the title

<div dir="rtl">


* [Learn C#&rlm;](URL) - Author Name
</div>

Result: image

But note that the symbol # renders as a RTL word, so it will reorder to the other side. so we must use &lrm; after this symbol.

<div dir="rtl">

* [Learn C#&lrm;&rlm;](URL) - Author Name
</div>

Result: image

Case 2

If the Title in English and the Author Name in Arabic

* [Learn HTML](URL) - إسم المؤلف

Result: image

It is enough to make the direction be RTL only without putting any Unicode mark

<div dir="rtl">

* [Learn HTML](URL) - إسم المؤلف
</div>

Result: image

Case 3

Sometimes we add some information like (:construction: *in process*) after the author name

<div dir="rtl">

* [عنوان بالعربي](URL) - Author Name (meta data)
* [Title In LTR&rlm;](URL) - Author Name (meta data)
</div>

Result: image

It seems like it is correct, but we read from right to left, so it would be nice if this information was in left to read the author name first then the information

So to solve this, we just put &rlm; after the name

<div dir="rtl">

* [عنوان بالعربي](URL) - Author Name&rlm; (meta data)
* [Title In LTR&rlm;](URL) - Author Name&rlm; (meta data)
</div>

Result: image

AhmedElTabarani avatar Feb 10 '22 22:02 AhmedElTabarani

if we set a section talking about this solution in Guidelines in CONTRIBUTING (after we finish discussing it here of course)

Other contributors can do the same with their own RTL languages

AhmedElTabarani avatar Feb 10 '22 22:02 AhmedElTabarani

Thanks for adding this. We can leave it open for a while.

eshellman avatar Feb 11 '22 02:02 eshellman

As commented in #6715 if this marks, HTML entity or unicode raw character breaks alphabetize plugin, even worst when are placed at the begining of sentence (the reason: see https://github.com/vhf/remark-lint-alphabetize-lists/blob/ee5f968040acf941c9c4d61fefb2bb1e3b1e8a7b/lib/alphabetical-list-items.js#L5-L14)

From Windows11 charmap.exe image

Moreover, non printable version should be used instead of HTML entity. Remember that Markdown markup should be HTML agnostic

davorpa avatar Feb 11 '22 21:02 davorpa

@davorpa i can make regex patterns for all these cases It that will help you to detect it automatically or something like that in future ?

AhmedElTabarani avatar Sep 21 '22 15:09 AhmedElTabarani

@davorpa i can make regex patterns for all these cases It that will help you to detect it automatically or something like that in future ?

Go ahead :wink:. It can be helpful to any maintainer :heart:

davorpa avatar Sep 21 '22 15:09 davorpa

@AhmedElTabarani Hello sir, can I work on this?

Mayank7225 avatar Oct 15 '22 04:10 Mayank7225

@AhmedElTabarani Hello sir, can I work on this?

About regex putterns ? Ok no problems at all

I was working on it but i was very busy this weeks.

I was decided to make a JavaScript script to detect all of these and some unit tests to make everything organized

This is last thing I ended up with, maybe it will help you.

Case 0 (It is enough to make a div with dir='rtl')
* [تعلم البرمجة](URL) - Author Name
Regex:
^\* \[[^\w\d\?><;,\{\}\[\]\-_\+=!@\#\$%^&\*\|\']+\]\(.+\) - .+(?<!\(.+\))$


Case 1
* [تعلم HTML](URL) - Author Name
Regex:
^\* \[[\u04c7-\u0591\u05D0-\u05EA\u05F0-\u05F4\u0600-\u06FF-\u0621-\u064A\d\?><;,\{\}\[\]\-_\+=!@\#\$%^&\*\|\' ]+[\w\d]+\]\(.+\) - [\w\ ]+$

Case 2
* [Learn HTML](URL) - Author Name
Regex:
^\* \[[^\u04c7-\u0591\u05D0-\u05EA\u05F0-\u05F4\u0600-\u06FF-\u0621-\u064A]+[\w\d]\]\(.+\) - [\w\ ]+$

Extra Case 1
* C#
* [تعلم لغة C# الرائعة](URL) - إسم المؤلف


Extra Case 1.1
* [Learn C#](URL) - Author Name


Extra Case 2 (It is enough to make a div with dir='rtl')
* [Learn HTML](URL) - إسم المؤلف

Extra case 3
* [عنوان بالعربي](URL) - Author Name (meta data)
* [Title In LTR&rlm;](URL) - Author Name (meta data)

AhmedElTabarani avatar Oct 15 '22 06:10 AhmedElTabarani

The main RTL languages are Arabic, Persian and Hebrew... which are only 3 out of all the languages translated on this repo... might be better to have a special section for these languages... as it is not relevant for all the LTR ones.

avipars avatar Oct 20 '22 12:10 avipars

Have you tried the following?

  • Update the CONTRIBUTING.md file to include a section for RTL languages, explaining the issues, solutions, and usage of Unicode marks (RLM and LRM) for different cases.

  • Create a separate section or a separate file specifically for Arabic, Persian, and Hebrew languages in the repository, as @avipars suggested. This would help maintain a better organization for RTL languages and make it easier to manage content for these languages separately.

CryptoMitch avatar Apr 09 '23 10:04 CryptoMitch

some good ideas in this issue. Would welcome a PR.

eshellman avatar Apr 09 '23 22:04 eshellman

does this issue still needs to be fixed

nerdberg792 avatar Oct 01 '23 09:10 nerdberg792

Can i work on this issue. Thankyou...

JatinSainiOO7 avatar Oct 07 '23 02:10 JatinSainiOO7