big-list-of-naughty-strings
big-list-of-naughty-strings copied to clipboard
Normal arabic strings should be removed from the naughty list
The below 2 strings should be removed from the list, it is normal Arabic strings that used frequently. There is no risk with them:
"﷽", "ﷺ",
You can check below the Arabic Unicode scripts: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
Where in blns.txt
do they appear?
Each section has a header explaining the purpose and most of them contain "ordinary stuff used by people every day which tends to break sites only tested against U.S. English".
in blns.txt: line 209 & 210 in blns.json: line 148 & 149
Don't agree they should be removed. The header is clear on the purpose of those strings. It's for testing if you app recognizes it as an rtl string.
They're "naughty" by definition because too many applications are coded on the assumption that, if your language isn't written left-to-right, your language itself is inherently "naughty".
I think OP's point is that those two examples are single codepoints, thus there is nothing to be ordered RTL or LTF.
﷽ = U+FDFD ﷺ = U+FDFA
Therefore, I agree with the OP that these two examples should be removed. They are the QA equivalent of "security theater": while they make look impressive to casual observers, they do not actually test anything.
...
OTOH, if your tests are also intended to check for problems in naughty GUI layout engines displaying words and phrases in wrong order, not just the usual Bobby Tables, then you also need to contruct some tests to see what happens when RTL and LTR text is interwoven, e.g. several Arabic phrases embedded within a larger English sentence with English text at start, end, and inbetween; and vice versa. (I can vouch for getting layout of mixed paragraphs being a giant PITA, especially when one is not able to read both languages, making incorrect ordering of the RTL and LTR chunks VERY easy to miss.)
For example, if a layout engine by default reads LTR and encounters an English word at the start of a paragraph, it will infer that the rest of that paragraph is English too, and any Arabic chunks will be ordered LTR within the larger English text:
E1 1A E2 2A E3.
Likewise if the layout strictly observes the user's default LTR/RTL preferences:
1A E1 2A E2 3A E3 4A.
Conversely, if the layout engine can infer (or your own preferences dictate) that the overall paragraph should be laid out as RTL text, the English chunks will be laid out from RTL within the larger Arabic text:
.4A E3 3A E2 2A E1 1A.
.E3 2A E2 1A E1
...
Needless to say, if the layout engine or user guess wrong, and lay out primarily Arabic text for the benefit of an English reader, that is going to cause its intended Arabic readers much upset. (Or lay out a paragraph for English readers with the start of the sentence at the top-right and end of sentence at top-left, though given the current bias towards LTR languages that is the less common case.)
Also be prepared for the final end-of-sentence punctuation characters to appear at the wrong end of the line, so that even in a completely Arabic paragraph (RTL) the final period (.) displays at the RHS of the paragraph's final line instead of its LHS:
cibarA ,cibarA cibarA cibarA
cibarA cibarA.
instead of:
cibarA ,cibarA cibarA cibarA
.cibarA cibarA
Don't have any actual example text here, unfortunately (least, not that I'm able to share). However, perhaps OP would like to create a couple of variations on,say, the following paragraph:
We would like to welcome Sir Isaac Newton, our good friend Dr. Grace D. N. Smith, M.D., and the very special Ms. Lisa Z. Johnson-Smythe to our latest gathering.
In one example, replace all the English names with equally complex Arabic/Persian names (e.g. "ابو سعيد الضرير الجرجاني" instead of Sir Isaac Newton), and in another replacing all the other words. That'll give you one English paragraph containing Arabic chunks, and an Arabic paragraph containing English chunks. Then make screenshots of how the text should correctly be laid out in each case for reference.
...
This is partly a technology problem: even in an "intelligent" "fully Unicode-aware" layout engine that knows something about human language rules and the larger context and meaning of a given text, accurately inferencing the overall layout of a mixed paragraph to be RTL or LTR is tricky and ambiguous, and quickly devolves into hard AI problem since the paragraph's rendering hinges on whether it is an English text with Arabic phrases to be read by English readers, or an Arabic text containing English phrases to be read by Arabic readers.
However, it is also very much a developer (and user) lack-of-awareness problem. Monolingual Western software developers with no experience of RLT scripts naturally assume that a "Unicode Aware" text layout engine will not only put all the Arabic words RTL on screen, but will also automatically resolve all the other ordering issues too: automatically right-aligning the paragraph, displaying the end-of-parargraph punctuation at the left end of the last line, arranging a mixture of English-phrases-in-Arabic-paragraph so that the entire text reads correctly to Arabic readers.) But just because all the bytes in a Unicode string are all in right order doesn't mean the glyphs on screen will appear in the correct positions too. And if you can't read both scripts yourself then you won't be able to tell for yourself if mixed content is displaying correctly or not, until users, on seeing their text displayed on screen/in print, scream that you've completely mangled their work.
As I say, you will also have to provide visual screenshots of right (and ideally also wrong) text layout examples, since even the standard text rendering engines on your PC do all sorts of weird and wonderful things so even they can't be trusted to display correctly. (e.g. I note with amusement that TextEdit displays "مُنَاقَشَةُ سُبُلِ اِسْتِخْدَامِ اللُّغَةِ فِي النُّظُمِ الْقَائِمَةِ وَفِيم يَخُصَّ التَّطْبِيقَاتُ الْحاسُوبِيَّةُ، " with the punctuation character at the left, which Firefox and Safari put it at the right.)
Perhaps OP (hazemsq), who I'm presuming is fluent in both English and Arabic, might be able to assist you formulate a comprehensive set of test strings and screenshots, or else direct you to others who can. (There are commercial translation agencies too who are experts in both converting text and already knowing all the exciting and obvious/arcane ways in which different apps and OSes manage to mangle its content and/or display, but those'll cost you.) ...
OTOH, if your test strings are ONLY intended to check for Bobby Tables cockups, and on-screen display is Not Your Problem, then you need to make this absolutely clear too, so that users understand that your sample strings are only a partial test designed to check for string handling bugs, not text display problems.
p.s. You should also try strings containing mixtures of ligatures and non-ligature characters, since both Western scripts, e.g. "ffi"/"ffi", and Arabic scripts support ligatures (in the latter case, they're required for proper as-written display). Unicode rules for Arabic even have to dictate how each character must use one of three different glyphs depending on whether it is at the start, middle, or ending of a word.
p.p.s. Oh, and don't even start me on fonts that support contextual alternates, especially when the alternate glyphs for existing characters do not have a codepoint associated at all, which means some APIs will substitute the dread Unicode Placeholder character instead of the original standard character when you try to get a string back out. (e.g. I recently had "fun" with GelatoScript, for instance, which supplied alternate "swooshier"—and codepoint-less!—glyphs for A-Z characters in addition to its standard glyphs so that graphic designers can select whichever one looks best on their artwork.)
p.p.p.s. Good work and best of luck. We all live in a global, not ASCII, world now (not to mention a shamelessy insecure one too), and desperately need more robust education and tooling like this.
TL;DR of the above post - single-character strings aren't really LTR or RTL, and lots of useful, interesting content including suggestions, real-world usage notes, the solution to hard AI, suggestion for clarification of repo description and many other things.