alreq icon indicating copy to clipboard operation
alreq copied to clipboard

Hyphenation in Arabic script writing systems

Open behnam opened this issue 8 years ago • 16 comments

I think we need to get to Hyphenation in a separate section, besides the justification matters. The most important question to answer in this section would be when it's okay to break the line in the middle of a word, and if so, how.

behnam avatar Apr 12 '17 00:04 behnam

https://www.tug.org/tugboat/tb27-2/tb87benatia.pdf

Benatia, Mohamed Jamal Eddine, Mohamed Elyaakoubi, and Azzeddine Lazrek. "Arabic text justification." TUGboat 27.2 (2006): 137-146.

screen shot 2017-04-11 at 7 03 12 pm

behnam avatar Apr 12 '17 00:04 behnam

Looks like Adobe Illustrator is trying to provide options for hyphenation, but the UI actually doesn't make much sense for Arabic text, so I assume they apply the same Latin logic to Arabic text.

https://helpx.adobe.com/illustrator/using/arabic-hebrew.html

screen shot 2017-04-11 at 7 09 05 pm

behnam avatar Apr 12 '17 00:04 behnam

From https://www.w3.org/TR/css-text-3/#hyphens-property

When shaping scripts such as Arabic are allowed to break within words due to ‘break-all’, the characters must still be shaped as if the word were not broken.

Also:

screen shot 2017-04-11 at 7 16 42 pm

@r12a, @ntounsi, do you know/remember what was the source for this decision? (I'm trying to gather all sources used existing practices.)

behnam avatar Apr 12 '17 00:04 behnam

From http://unicode.org/reports/tr14/

Unicode® Standard Annex #14 UNICODE LINE BREAKING ALGORITHM

Hyphenation, and therefore the SHY, can be used with the Arabic script. If the rendering system breaks at that point, the display—including shaping—should be what is appropriate for the given language. For example, sometimes a hyphen-like mark is placed on the end of the line. This mark looks like a kashida, but is not connected to the letter preceding it. Instead, the appearance of the mark is as if it had been placed—and the line divided—after the contextual shapes for the line have been determined. For more information on shaping, see [UAX9] and Section 9.2, Arabic, of [Unicode].

I'm guessing this was the source for the css-text-3 decision. What do you think?

behnam avatar Apr 12 '17 00:04 behnam

From https://drafts.csswg.org/css-text/

screen shot 2017-04-11 at 7 44 27 pm

behnam avatar Apr 12 '17 00:04 behnam

I looked at a bunch of Persian newspapers from this week, couldn't find a single instance of Hyphenation. My guess is that they all turn it off because of law quality of the existing digital solutions used for typesetting.

behnam avatar Apr 12 '17 03:04 behnam

From "ketaab-e jom'e" from early 1980s:

Sample 1: "دیوان - سالاری" across lines.

screen shot 2017-04-11 at 10 18 14 pm

Sample 2: "ویژه - نامه‌ها" and "حتّی - المقدور" across lines, consecutive.

ketaab-e jome issue 31 sample 5

Sample 3: "می - کنند" across pages.

ketaab-e jome issue 31 sample 2

Sample 4: "رودر - رویی" across columns.

screen shot 2017-04-11 at 10 21 04 pm screen shot 2017-04-11 at 10 21 08 pm

Notes:

  • All samples show inter-joining-segment hyphenation.
  • I couldn't find any instance of intra-joining-segment hyphenation in these publications.

behnam avatar Apr 12 '17 03:04 behnam

My understanding is that the only modern Arabic orthography allowing hyphenation is Uyghur Ereb Yëziqi (the modern Arabic based orthography, the old, also Arabic based, orthography did not allow hyphenation), whose behavior is what Unicode and CSS are describing. I’ve been told that at some point (in the 80s?) Persian publications did use hyphenation and, IIRC, it was only allowed at ZWNJ.

Arabic language AFAIK never had hyphenation, even in the early stages of the orthography when breaking inside words was allowed it didn’t use a hyphen when breaking words and the breaking would only happen between unjoined letters (i.e. only after right joining letters) and never between joined ones.

In the second sura (middle left of page), lines 4/5 السمو / ت, lines 7/8 ا / لحسنى, etc.

lines 1/2 و / احدة, lines 2/3 ر / تلنه, etc.

khaledhosny avatar Apr 12 '17 10:04 khaledhosny

@r12a, @ntounsi, do you know/remember what was the source for this decision? (I'm trying to gather all sources used existing practices.)

@fantasai is the person to ask.

r12a avatar Apr 12 '17 11:04 r12a

For Persian...

Well, we have plenty of evidence of hyphenation, at least in Movable Type sources, starting from 1970's, and possibly earlier.

Also, I remember seeing it in more recent publications, some computer typeset, but in extreme situations thought, like very narrow columns.

Also, I remember being taught about it in elementary school (4th grade, IIRC), specially as a writing practice. It was not in the books, AFAIR, but the teacher would teach you in the class. Although, I don't remember if the teacher asking us to break the word at segment boundary, but have a fuzzy memory of being taught to break it as syllable boundary.

I looked at many of the language and writing 1-12 books today hoping to find some mention of hyphenation. No luck.

Based on these, I think it's better to document it, as a last resort solution for some languages, including Persian, with explanation of both inter-segments and inter-syllable methods.

@shervinafshar, @mostafah, what do you think? Do you have better material on this? Maybe in Adib-Soltani book? (I don't have my copy here...)

behnam avatar Apr 13 '17 06:04 behnam

Agree with Khalid about Arabic language. I've never seen hyphenation.

However, and regardless of hyphenation, a situation where a word can be cut in two, is found in poetry between the two half-lines of the same verse. The breaking doesn't always happen at a non joining boundary.

wordbreakinpoem

ntounsi avatar Apr 13 '17 13:04 ntounsi

Very good point, @ntounsi! I remember we talked about this case once. I'll include this into the Joining section. (#97)

Now, a question would be, should we categorize this behavior under Hyphenation? Or maybe under Justification? (#57)

And, what would you call this behavior in Arabic? Any specific terms you use?

behnam avatar Apr 14 '17 17:04 behnam

And, what would you call this behavior in Arabic? Any specific terms you use?

The Arabic term is التدوير (al-ttadwīr). Wikipedia.

ntounsi avatar Apr 17 '17 16:04 ntounsi

Uyghur hyphenates. http://fantasai.inkedblade.net/style/scans/LoC025.png This is why it was added.

fantasai avatar Apr 18 '17 06:04 fantasai

From http://fantasai.inkedblade.net/style/scans/LoC025.png

screen shot 2017-04-18 at 1 52 55 am

Features:

  • Intra-segment line break with hyphenation.
  • Inter-segment line break with hyphenation.
  • Plenty of ZWNJ.

behnam avatar Apr 18 '17 06:04 behnam

Hyphenation Character

We also need to note that the character used for hyphenation (CSS' hyphenate-character) is commonly expected to sit on the baseline, similar to TATWEEL, but non-joining by itself.

Possible default characters are:

  • U+002D HYPHEN-MINUS
  • U+2010 HYPHEN

Preferred character probably depends on their existence in the font in use. If U+2010 is available, it's more trusted to have the right shape. If not, falling back to U+002D is one option, another being TATWEEL.

No matter which character is used, there needs to be some space between the hyphen and previous letter (like a narrow-space), whether the letter is in join-on-left form or not.

behnam avatar Apr 27 '17 23:04 behnam