CheatSheetSeries icon indicating copy to clipboard operation
CheatSheetSeries copied to clipboard

Update: Input Validation

Open mattt opened this issue 3 years ago • 4 comments

What is missing or needs to be updated?

The section "Validating free-form Unicode text" describes the following as one of the primary means of validating free-form text input:

Character category whitelisting: Unicode allows whitelisting categories such as "decimal digits" or "letters" which not only covers the Latin alphabet but also various other scripts used globally (e.g. Arabic, Cyrillic, CJK ideographs etc).

How should this be resolved?

It would be helpful to include links to the following Unicode technical reports as external references:

For example, UAX 31 provides the following guidance for implementing #hashtag functionality in an application, which I think does a good job illustrating the inherent complexity of working with Unicode text while also offering useful advice:

UAX31-R8. Extended Hashtag Identifiers: To meet this requirement, to determine whether a string is a hashtag identifier an implementation shall use definition UAX31-D2, setting:

  1. Start := [#﹟#]
    • U+0023 NUMBER SIGN
    • U+FE5F SMALL NUMBER SIGN
    • U+FF03 FULLWIDTH NUMBER SIGN
    • (These are # and its compatibility equivalents.)
  2. Medial is currently empty, but can be used for customization.
  3. Continue := XID_Continue, plus Extended_Pictographic, Emoji_Component, and “”, “-”, “+”, minus Start characters._
    • Note the subtraction of # characters.
    • _This is expressed in set notation as:
      [\p{XID_Continue}\p{Extended_Pictographic}\p{Emoji_Component}[-+]-[#﹟#]]
  • Alternatively, it shall declare that it uses a profile as in UAX31-R1.

The Emoji properties are from the corresponding version of [UTS51]. The version of the emoji properties is tied to the version of the Unicode Standard, starting with Version 11.0.

The grandfathering techniques mentioned in Section 2.5 Backward Compatibility may be used where stability between successive versions is required.

Comparison and matching should be done after converting to NFKC_CF format. Thus #MötleyCrüe should match #MÖTLEYCRÜE and other variants.

This information could also be incorporated into the next section, discussing regular expressions.

mattt avatar Feb 22 '21 13:02 mattt

@mattt good idea. Do you want to make a PR with these changes?

mackowski avatar Feb 22 '21 14:02 mackowski

@mackowski Sure thing! I'll have that ready for y'all to review soon.

mattt avatar Feb 22 '21 17:02 mattt

We're ready for a PR or just let us know via text here (and just provide a little more detail) and I'll take care of it for you. Thanks @mattt !!!

jmanico avatar Mar 20 '21 01:03 jmanico

Hey @mattt this is an old issue, do you want to make a PR for it? :)

mackowski avatar Jun 20 '22 17:06 mackowski

@mattt are you still planning to work on this? Otherwise I might have time to handle it.

szh avatar Nov 06 '22 22:11 szh

Sorry, I don't have the bandwidth to work on this right now. Anyone else is welcome to pick this up.

mattt avatar Nov 07 '22 14:11 mattt