unicode-confusable icon indicating copy to clipboard operation
unicode-confusable copied to clipboard

Canonical versions?

Open cbetta opened this issue 8 years ago • 2 comments

It would be cool to be able to use this to prevent similar looking usernames in a DB. What would it take to make that happen? I imagine one thing that would be helpful is to convert a string into some identifier or canonical function.

Unicode::Confusable.clarify("ℜ𝘂ᖯʏ") =>  "Ruby" 

# or

Unicode::Confusable.id_for("ℜ𝘂ᖯʏ") =>  "123abc" 
Unicode::Confusable.id_for("Ruby") =>  "123abc" 

cbetta avatar Mar 13 '16 17:03 cbetta

The comparing process creates a "skeleton" of a string, which is directly compared (= same bytes) with the other string's skeleton. I haven't mentioned in the ReadMe how to generate it, but it is available as a public method:

Unicode::Confusable.skeleton("ℜ𝘂ᖯʏ") #=> "Ruby"

However, it is not a unique representation, quoting the standard:

Note: The strings skeleton(X) and skeleton(Y) are not intended for display, storage or
transmission. They should be thought of as an intermediate processing form, similar to a
hashcode. The characters in skeleton(X) and skeleton(Y) are not guaranteed to be identifier
characters. 

So it cannot solely act as identifier, but personally, I cannot see what would be wrong with storing it in an additional database column, which you can then use for skeleton lookups.

Related specifications (unicode identifiers/security):

  • http://www.unicode.org/reports/tr31/
  • http://www.unicode.org/reports/tr39/
  • http://www.unicode.org/reports/tr36/

janlelis avatar Mar 14 '16 09:03 janlelis

Awesome. Any idea why the spec says you can't use them for identifying purposes? Seems pretty accurate to me from some of the tests I ran.

cbetta avatar Mar 14 '16 10:03 cbetta