apa icon indicating copy to clipboard operation
apa copied to clipboard

character based section I18N considerations [I18N]

Open aphillips opened this issue 5 years ago • 6 comments

2.1.1 Traditional Character-Based CAPTCHA https://w3c.github.io/apa/captcha/#traditional-character-based-captcha

While some sites have begun providing CAPTCHAs utilizing languages other than English, an assumption that all web users can understand and reproduce English predominates. Clearly, this is not the case. Arabic or Thai speakers, for example, should not be assumed to possess a proficiency with the ISO 8859-1 character set [iso-8859-1], let alone have a keyboard that can easily produce those characters in the CAPTCHA's form field. Research has demonstrated how CAPTCHAs based on written English impose a significant barrier to many on the web; see Effects of Text Rotation, String Length, and Letter Format on Text-based CAPTCHA Robustness [captcha-robustness].

The above text has several potential issues:

  1. ISO8859-1 ("Latin-1") is possibly not the best reference here, since probably is what is meant are ASCII letters and digits. The difference between Latin-1 and ASCII are the various accented letters, which are not widely used in CAPTCHA.
  2. Virtually all computing systems have a means of inputting ASCII, so saying that users might not have a "keyboard that can easily produce those characters" is probably false.
  3. The reverse is not true. Producing CAPTCHA images containing non-ASCII text may prove difficult to use if the user does not have the appropriate keyboard available. It is difficult to determine on the server side what the input capabilities of a given user agent includes.
  4. Many characters or writing systems are difficult to discern when distorted. This includes accented Latin-script letters, cursive scripts such as Arabic, and of course Han ideographs.
  5. It has been observed that using actually words for CAPTCHA improves accuracy, but or course this depends on being fluent in the language in question.

*This comment is part of the I18N horizontal review. *

aphillips avatar Jun 03 '19 19:06 aphillips

Thanks, Addison, and thanks to your I18N colleagues.

We've redrafted the relevant paragraph and hope we've captured your comment. The rewrite is the last paragraph at this section in the document:

https://w3c.github.io/apa/captcha/#traditional-character-based-captcha

Best,

Janina

Addison Phillips writes:

2.1.1 Traditional Character-Based CAPTCHA https://w3c.github.io/apa/captcha/#traditional-character-based-captcha

While some sites have begun providing CAPTCHAs utilizing languages other than English, an assumption that all web users can understand and reproduce English predominates. Clearly, this is not the case. Arabic or Thai speakers, for example, should not be assumed to possess a proficiency with the ISO 8859-1 character set [iso-8859-1], let alone have a keyboard that can easily produce those characters in the CAPTCHA's form field. Research has demonstrated how CAPTCHAs based on written English impose a significant barrier to many on the web; see Effects of Text Rotation, String Length, and Letter Format on Text-based CAPTCHA Robustness [captcha-robustness].

The above text has several potential issues:

  1. ISO8859-1 ("Latin-1") is possibly not the best reference here, since probably is what is meant are ASCII letters and digits. The difference between Latin-1 and ASCII are the various accented letters, which are not widely used in CAPTCHA.
  2. Virtually all computing systems have a means of inputting ASCII, so saying that users might not have a "keyboard that can easily produce those characters" is probably false.
  3. The reverse is not true. Producing CAPTCHA images containing non-ASCII text may prove difficult to use if the user does not have the appropriate keyboard available. It is difficult to determine on the server side what the input capabilities of a given user agent includes.
  4. Many characters or writing systems are difficult to discern when distorted. This includes accented Latin-script letters, cursive scripts such as Arabic, and of course Han ideographs.
  5. It has been observed that using actually words for CAPTCHA improves accuracy, but or course this depends on being fluent in the language in question.

*This comment is part of the I18N horizontal review. *

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/w3c/apa/issues/29

--

Janina Sajka

Linux Foundation Fellow Executive Chair, Accessibility Workgroup: http://a11y.org

The World Wide Web Consortium (W3C), Web Accessibility Initiative (WAI) Chair, Accessible Platform Architectures http://www.w3.org/wai/apa

JaninaSajka avatar Jun 04 '19 20:06 JaninaSajka

The new text is:

Clearly, this is not the case. Users of Arabic or Thai+character sets, for example, may not be familiar with the English alphabet or may not have enough knowledge to identify a distorted version of such characters. Furthermore, the default keyboard is likely to be localized, +potentially making it difficult to enter an alternative character set unless specifically set up to do so. Research has demonstrated how CAPTCHAs based on written English impose a significant barrier to many on the web; see Effects of Text Rotation, String Length, and Letter Format on Text-based CAPTCHA Robustness [captcha-robustness].

It actually seems to be a little less clear to me. Here's an attempt to provide text:

Clearly, this is not the case. Research has demonstrated how CAPTCHAs based on written English impose a significant barrier to many on the web; see Effects of Text Rotation, String Length, and Letter Format on Text-based CAPTCHA Robustness [captcha-robustness]. This problem is likely to increase when using Latin-script characters beyond the ASCII range, with accents and diacritics, or shapes not included in the set used for English. For example, speakers of Arabic or Thai may not have enough knowledge to identify a distorted version of such characters. Furthermore, users may not have the necessary keys available on their local keyboard.

r12a avatar Jun 24 '19 15:06 r12a

I agree with @r12a's suggestions.

I also sense that this text (omitted in the quote above):

While some sites have begun providing CAPTCHAs utilizing languages other than English, an assumption that all web users can understand and reproduce English predominates.

... is trying to address the use of English words in the CAPTCHA (since humans can better figure out highly distorted text if it "spells something"). This requires a knowledge of English spelling if the words are in English (English is famous for the irregularity of its spelling). The larger problem is: any CAPTCHA based on words requires familiarity with the lexical norms of the target language.

aphillips avatar Jun 24 '19 16:06 aphillips

Thank you. This language now in the Editor's Draft. Sorry it didn't make it for the wide review publication.

r12a writes:

The new text is:

Clearly, this is not the case. Users of Arabic or Thai+character sets, for example, may not be familiar with the English alphabet or may not have enough knowledge to identify a distorted version of such characters. Furthermore, the default keyboard is likely to be localized, +potentially making it difficult to enter an alternative character set unless specifically set up to do so. Research has demonstrated how CAPTCHAs based on written English impose a significant barrier to many on the web; see Effects of Text Rotation, String Length, and Letter Format on Text-based CAPTCHA Robustness [captcha-robustness].

It actually seems to be a little less clear to me. Here's an attempt to provide text:

Clearly, this is not the case. Research has demonstrated how CAPTCHAs based on written English impose a significant barrier to many on the web; see Effects of Text Rotation, String Length, and Letter Format on Text-based CAPTCHA Robustness [captcha-robustness]. This problem is likely to increase when using Latin-script characters beyond the ASCII range, with accents and diacritics, or shapes not included in the set used for English. For example, speakers of Arabic or Thai may not have enough knowledge to identify a distorted version of such characters. Furthermore, users may not have the necessary keys available on their local keyboard.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/w3c/apa/issues/29#issuecomment-505069768

--

Janina Sajka

Linux Foundation Fellow Executive Chair, Accessibility Workgroup: http://a11y.org

The World Wide Web Consortium (W3C), Web Accessibility Initiative (WAI) Chair, Accessible Platform Architectures http://www.w3.org/wai/apa

JaninaSajka avatar Jun 26 '19 18:06 JaninaSajka

Hi,

Addison Phillips writes:

I agree with @r12a's suggestions. They're now in the Editor's Draft. My bad for not catching them in time for the wide review publication. Sorry!

I also sense that this text (omitted in the quote above):

While some sites have begun providing CAPTCHAs utilizing languages other than English, an assumption that all web users can understand and reproduce English predominates.

... is trying to address the use of English words in the CAPTCHA (since humans can better figure out highly distorted text if it "spells something"). This requires a knowledge of English spelling if the words are in English (English is famous for the irregularity of its spelling). The larger problem is: any CAPTCHA based on words requires familiarity with the lexical norms of the target language.

Not sure our thinking was that sophisticated, but should we try to capture this somehow? I think you're saying the problem is exacerbated when the task involves recognizing words not just chars. I can understand how that would make recognizing the chars harder, but is it a distinction with a function difference since one is still reproducing a string of chars?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/w3c/apa/issues/29#issuecomment-505086312

--

Janina Sajka

Linux Foundation Fellow Executive Chair, Accessibility Workgroup: http://a11y.org

The World Wide Web Consortium (W3C), Web Accessibility Initiative (WAI) Chair, Accessible Platform Architectures http://www.w3.org/wai/apa

JaninaSajka avatar Jun 26 '19 18:06 JaninaSajka

Under https://w3c.github.io/apa/captcha/#the-accessibility-challenge 1.2 The Accessibility Challenge

it already says that

traditional CAPTCHAs have generally presumed that all web users can read and transcribe English-based words and characters, thus making the test inaccessible to a large number of non-English speaking web users worldwide.

A good place to mention this again would be https://w3c.github.io/apa/captcha/#version-2-are-you-a-robot

One reCAPTCHA V. 2 innovation seems most promising. Rather than reproduce characters, users are asked to type the words they see (or hear). It even appears unnecessary to spell these correctly or to enter all the words presented in order to be adjudged human.

If you really want to cover all the bases, it should probably also be mentioned in the section that talks about sound output, see https://w3c.github.io/apa/captcha/#sound-output

r12a avatar Jun 27 '19 09:06 r12a