alreq icon indicating copy to clipboard operation
alreq copied to clipboard

Review of the characters table and the changes needed for CLDR

Open ntounsi opened this issue 7 years ago • 17 comments

Review of the characters table and the changes needed for CLDR. (Other details may be needed for this review)

Letters U+0671 ARABIC LETTER ALEF WASLA is marked X not used. Should be marked as auxiliary for Arabic. May be check-marked for Persian

Diacritics U+0670 ARABIC LETTER SUPERSCRIPT ALEF, is used auxiliary for some Koran publications. Should be marked * instead of X

Some punctuation and symbols U+0020 SPACE U+002A ASTERISK * U+002F SOLIDUS / U+003C LESS-THAN < U+003D EQUALS = U+003E GREATER-THAN > are marked X not used for Arabic. Should be check-marked.

Control characters Some of control characters (at least those related to Bidi), [U+202A..U+202E] and [U+2066..U+2069], ZWJ & ZNWJ, are not language related. Think they should be marked as auxiliary. Assuming that they are not intended for normal, but special use.

ntounsi avatar Apr 10 '17 22:04 ntounsi

See Also: https://r12a.github.io/scripts/arabic/block#char0671

@ntounsi, @khaledhosny, can you provide more details about the use of U+0670 and U+0671 in Arabic language, hopefully standard and local variants?

behnam avatar Apr 11 '17 15:04 behnam

Re U+0671: https://en.wikipedia.org/wiki/Dagger_alif

SUPERSCRIPT ALEF is one of the characters in "الله‎" ligature, making it a common one on modern usage.

behnam avatar Apr 11 '17 15:04 behnam

That is U+0670.

Re U+0671 can be used in any Arabic word starting with alif wasl, usually U+0627 is used, but some publication will use U+0671 for various reasons. Of course the most common use of it is in Quran.

https://en.wikipedia.org/wiki/Wasla https://en.wiktionary.org/wiki/%D9%B1

khaledhosny avatar Apr 11 '17 17:04 khaledhosny

Also I think this quote from https://r12a.github.io/scripts/arabic/block#char0671 should be taken with a grain of salt, for one modern standard Arabic do use case endings (there is nothing “old” about them), it is just that some people do avoid them in fear of getting the rules wrong, but that is in no way a standard or generally celebrated practice:

The joining hamza is of little practical importance in modern arabic pronounced without the old case endings.

khaledhosny avatar Apr 11 '17 17:04 khaledhosny

Here are some results from google books search (lots of other garpage results, though, indexing Arabic PDFs is a lost cause 😞):

https://books.google.com.eg/books?id=97l1BwAAQBAJ&pg=PA60&dq=%D9%B1&hl=en&sa=X&ved=0ahUKEwi73_iq85zTAhWDOBQKHaHECHU4FBDoAQhbMAg#v=onepage&q=%D9%B1&f=false https://books.google.com.eg/books?id=WpByAgAAQBAJ&pg=PA162&dq=%D9%B1&hl=en&sa=X&ved=0ahUKEwiPkauQ85zTAhVEPhQKHf6iBHEQ6AEIRDAG#v=onepage&q=%D9%B1&f=false

khaledhosny avatar Apr 11 '17 17:04 khaledhosny

Since gbook links are not always reliable, here's a snapshot from the second link (I couldn't see the first one):

https://books.google.com.eg/books?id=WpByAgAAQBAJ&pg=PA162&dq=%D9%B1&hl=en&sa=X&ved=0ahUKEwiPkauQ85zTAhVEPhQKHf6iBHEQ6AEIRDAG#v=onepage&q=%D9%B1&f=false

screen shot 2017-04-11 at 5 50 40 pm

behnam avatar Apr 11 '17 22:04 behnam

BTW, the Wasla sign alefwasla looks like the letter Sad ص followed by Heh final form ﻪ in some old style. sah The resulting word is "Sah", meaning "Shut up!", used to demand silence. A Koran reader who reachs this kind of sign (above any letter), must pause.

ntounsi avatar Apr 11 '17 22:04 ntounsi

The resulting word is "Sah", meaning "Shut up!", used to demand silence : «don't pronounce this Alef».

Interesting!

Btw, let's keep the issue open until we file CLDR tickets and they are resolved.

behnam avatar Apr 11 '17 23:04 behnam

sahinkoran

ntounsi avatar Apr 11 '17 23:04 ntounsi

Comments from conf-call:

  • Arabic Question Mark is not present in any of the tables.
  • ASCII Question Mark should NOT be marked as "used" for either languages.
  • Whether we should have an ASCII table, or keep it as "Punctuations and Symbols"

@mostafah is going to work on fixing the script to address some of the problems.

behnam avatar Apr 18 '17 15:04 behnam

@ntounsi, @khaledhosny, would you use any of these two chars in Modern Arabic, besides their Composed form with ALEF?

  • U+0653 ARABIC MADDAH ABOVE
  • U+0655 ARABIC HAMZA BELOW

behnam avatar Apr 27 '17 18:04 behnam

CLDR ticket filed: http://unicode.org/cldr/trac/ticket/10221

behnam avatar Apr 27 '17 19:04 behnam

@mostafah, assigning this to you to work on the script.

Also, would you please take a look why these two get marked as main/used in Persian, which I believe shouldn't be marked at all?

  • U+2060 WORD JOINER
  • U+FEFF ZERO WIDTH NO-BREAK SPACE

behnam avatar Apr 27 '17 19:04 behnam

I don’t think U+0653 or U+0655 are likely to be used since they are almost always combined with alef and precomposed characters for them exist. However, several characters have canonical decomposition involving both, so NFD text will have them.

khaledhosny avatar Apr 27 '17 19:04 khaledhosny

@behnam Sure. Thanks for the CLDR ticket.

mostafah avatar Apr 29 '17 06:04 mostafah

Created a new issue regarding Section A.5 Control characters: https://github.com/w3c/alreq/issues/127

behnam avatar Jun 27 '17 10:06 behnam

See this comment on #128.

shervinafshar avatar Feb 06 '18 17:02 shervinafshar